Distributed Systems Engineer — Build Databases & Consensus From Scratch

"What I cannot create, I do not understand." — Richard Feynman

A lab-based curriculum for becoming a senior distributed systems engineer by building the systems you'll one day operate, debug, and replace: LevelDB (LSM-tree storage), SQLite (B-tree storage + SQL), and the three canonical consensus algorithms — Raft, Paxos, and ZAB — all implemented from scratch in Rust, Go, and C++.

Why This Repo Exists

Most engineers treat databases and consensus as black boxes. This curriculum makes them transparent. You will:

  • Write storage engines that flush, compact, recover, and serve concurrent reads.
  • Implement consensus protocols that survive node crashes, network partitions, and message reordering.
  • Reason about hardware trade-offs: SSD vs HDD seek latency, write amplification, fsync cost, io_uring vs blocking I/O, cache-line locality, NUMA effects.
  • Compare algorithm families: LSM vs B-tree, level-based vs size-tiered compaction, Raft vs Multi-Paxos vs ZAB.
  • Build the same thing three times — once in each language — to internalize the design (not the syntax).

Curriculum at a Glance

PhaseThemeLabs
1Storage Primitives & Foundationsdb-01db-04
2LevelDB / LSM-Treedb-05db-09
3SQLite / B-Treedb-10db-15
4Consensus Algorithmsdb-16db-20
5Advanced Storage & Capstonedb-21db-23

See PHASES.md for the full breakdown with learning objectives per lab.

How To Use This Repo

  1. Read TOOLS.md and install the required toolchains (Rust, Go, C++/CMake).
  2. Start with db-01-storage-primitives/. Each lab is self-contained and has the same shape:
    db-NN-<name>/
    ├── CONCEPTS.md       # The "why" — read this first
    ├── references.md     # Papers and source-code links to study
    ├── docs/
    │   ├── analysis.md       # Design trade-offs (hardware, algorithmic)
    │   ├── broader-ideas.md  # Extensions, alternatives, future work
    │   ├── execution.md      # Toolchain versions, quick-start commands
    │   ├── observation.md    # Debugging, profiling, monitoring
    │   └── verification.md   # Pass/fail checks for your implementation
    ├── steps/            # Numbered, sequential implementation guides
    │   ├── 01-*.md
    │   └── 02-*.md
    └── src/
        ├── rust/         # Cargo workspace
        ├── go/           # Go module
        └── cpp/          # CMake project
    
  3. Work through steps/ in order. The reference code in src/ is a target — try to write your own first, then compare.
  4. Run the checks in docs/verification.md before moving on.

What You Will Build

By the end of the curriculum you will have implemented (×3 languages):

  • A crash-safe write-ahead log with CRC32 checksums and group commit.
  • A skip-list MemTable, an SSTable file format with block compression, and level-based compaction.
  • A page-oriented B+-tree with a pager, rollback journal, and WAL mode.
  • A hand-written SQL tokenizer, parser, AST, and bytecode virtual machine.
  • A transaction manager with MVCC snapshot reads and serializable writes.
  • A complete Raft implementation with snapshotting and membership changes.
  • Single-decree Paxos and Multi-Paxos with a stable leader.
  • A simplified ZAB broadcast layer with epoch transitions.
  • A 3-node distributed KV store combining Raft with your LevelDB clone.
  • A capstone mini distributed SQL database (the storage engine, the SQL frontend, and Raft replication — all your own code).

Prerequisites

  • Comfortable with C-family syntax in at least one systems language (you'll pick up the other two as you go).
  • Familiarity with binary trees, hash tables, and Big-O analysis.
  • Basic Linux command-line and git.
  • Not required: prior distributed systems knowledge, SQL internals knowledge, or database engine experience. We build it all from the ground up.

Pedagogical Style

Modeled after cstack/db_tutorial (concept-first, incremental, runnable code at every step) and the ai-engineering/ lab repo (consistent 8-part CONCEPTS.md, docs/, steps/, src/ structure).

Every CONCEPTS.md follows the same 8-part framework:

  1. What Is It — one-paragraph executive summary
  2. Why It Matters — concrete benefits
  3. How It Works — ASCII architecture diagram
  4. Core Terminology — table of precise definitions
  5. Mental Models — analogies for intuition
  6. Common Misconceptions — myths corrected
  7. Interview Talking Points — what to say in a senior systems interview
  8. Connections to Other Labs — how this fits the bigger picture

Status

PhaseStatus
Phase 1 — Storage PrimitivesLab 01 complete, 02–04 scaffolded
Phase 2 — LevelDBScaffolded
Phase 3 — SQLiteScaffolded
Phase 4 — ConsensusScaffolded
Phase 5 — Advanced & CapstoneScaffolded

See PHASES.md for per-lab status.

License

MIT — see source headers in each implementation.

Phases & Labs

This curriculum has 5 phases and 23 labs. Phases build on each other, but within Phase 4 (consensus) you can do Raft → Paxos → ZAB in any order after the foundations in db-16.

Legend: ✅ complete · 🟡 scaffolded · ⬜ planned


Phase 1 — Storage Primitives & Foundations

Before you can build a database, you need to understand the medium it lives on.

LabTitleStatusKey Concepts
db-01Storage PrimitivesPages, byte order, mmap vs pread, alignment, HDD/SSD/NVMe latency
db-02Data Structures for Storage🟡Skip lists, hash tables, when in-memory vs on-disk structures differ
db-03Write-Ahead Log🟡WAL framing, CRC32, fsync semantics, group commit
db-04Bloom Filters & Hashing🟡FPR math, xxHash vs Murmur, cuckoo & xor filter alternatives

Phase 2 — LevelDB / LSM-Tree

Build a production-shape LSM-tree key-value store, the way Google built LevelDB and Meta forked it into RocksDB.

LabTitleStatusKey Concepts
db-05LSM MemTable🟡Skip-list MemTable, immutable MemTable, flush trigger
db-06SSTable Format🟡Data/index/filter blocks, restart points, footer
db-07LSM Compaction🟡Level vs size-tiered vs universal, write amplification
db-08Block Cache & Iterators🟡LRU, MergingIterator, snapshot via sequence numbers
db-09LevelDB Complete🟡Open/close, WriteBatch, recovery, YCSB benchmark

Phase 3 — SQLite / B-Tree

Build a B+-tree storage engine, a pager, a SQL parser, a bytecode VM, and a transaction manager.

LabTitleStatusKey Concepts
db-10B-Tree Fundamentals🟡B-Tree vs B+-Tree, page layout, splits & merges
db-11Pager System🟡Page cache, rollback journal vs WAL mode, checkpointing
db-12SQL Frontend🟡Tokenizer, parser, AST, VDBE bytecode VM
db-13Transactions & MVCC🟡ACID, isolation levels, SQLite locks, MVCC vs 2PL
db-14Indexes & Query Planning🟡Secondary indexes, cost-based planner, ART, BRIN
db-15SQLite Complete🟡JOINs, aggregation, TPC-H subset benchmark

Phase 4 — Consensus Algorithms

The three canonical consensus families — implemented, tested, and compared.

LabTitleStatusKey Concepts
db-16Distributed Fundamentals🟡CAP, FLP, linearizability, vector clocks, HLC
db-17Raft🟡Election, AppendEntries, snapshotting, ReadIndex
db-18Paxos🟡Single-decree, Multi-Paxos, Flexible Paxos
db-19ZAB🟡Epochs, zxids, primary-backup vs leader-based
db-20Distributed KV Store🟡Raft + LevelDB backend, linearizable reads, sharding

Phase 5 — Advanced Storage & Capstone

LabTitleStatusKey Concepts
db-21Advanced Storage🟡io_uring, O_DIRECT, columnar layout, WiscKey
db-22Performance & Benchmarking🟡YCSB A–F, flamegraphs, NUMA, perf counters
db-23Capstone Distributed DB🟡SQL → planner → LevelDB → Raft; 2PC over Raft groups

Suggested Pace

  • Full-time learner: ~2 labs per week ⇒ ~12 weeks end-to-end.
  • Side-project learner: ~1 lab every 1–2 weeks ⇒ ~6 months.
  • Reading-only path: skim CONCEPTS.md + docs/analysis.md per lab ⇒ ~1 week for the entire curriculum.
Phase 1 (must do all 4 in order)
   │
   ├─→ Phase 2 (LevelDB)  ──┐
   │                        │
   └─→ Phase 3 (SQLite) ────┤
                            ↓
                         Phase 4 (Consensus)
                            ↓
                         Phase 5 (Capstone)

Phase 2 and Phase 3 are independent — pick the storage style that excites you first. Phase 4 only references Phase 1 fundamentals, so you can detour into consensus early if you want. Phase 5's capstone assumes all four prior phases.

Glossary

A unified glossary of terms used across all labs. Terms are grouped by domain.

Storage & I/O

TermDefinition
PageThe unit of I/O between disk and memory. Usually 4 KiB (matches OS page size) but databases often use 4–32 KiB.
BlockAn SSTable's I/O unit (LevelDB default 4 KiB). Distinct from a B-tree "page" — both are I/O units but for different engines.
mmapMap a file into process address space. Reads happen via page faults; writes via dirty pages flushed by the kernel.
pread/pwritePositional read/write syscalls. Explicit offset, no shared file pointer. Predictable cost, no page-fault stalls.
O_DIRECTOpen flag (Linux) that bypasses the page cache. Requires aligned buffers, aligned offsets, aligned sizes.
fsyncForce file data + metadata to stable storage. Blocks until disk acknowledges. Often the slowest syscall in a database.
fdatasyncLike fsync but skips non-essential metadata. Faster on most filesystems.
Write amplification (WA)Bytes physically written / bytes logically written. SSDs have hardware WA; LSM-trees have algorithmic WA from compaction.
Read amplification (RA)Bytes physically read / bytes logically read. LSM-trees suffer from RA due to checking multiple levels.
Space amplificationBytes on disk / bytes of live data. LSMs have space amp from stale data awaiting compaction.
EndiannessByte order. Little-endian (x86, ARM default): least-significant byte first. Big-endian: network byte order.
AlignmentMemory address being a multiple of N. Required for O_DIRECT (usually 512 B or 4 KiB) and SIMD ops.
io_uringLinux async I/O API (≥ 5.1). Two ring buffers (SQ/CQ) shared between kernel and user space.
DMADirect Memory Access — disk controller writes directly to RAM without CPU involvement.

Hardware

TermDefinition
HDD seek time~5–10 ms for random reads (head movement + rotational latency). ~150 MB/s sequential.
SATA SSD~100 μs random read latency, ~500 MB/s sequential, ~80K IOPS.
NVMe SSD~50–100 μs random read latency, ~3–7 GB/s sequential, ~500K–1M IOPS. Multiple hardware queues.
Cache lineCPU cache unit, almost always 64 bytes. Data-structure layout for cache locality matters.
NUMANon-Uniform Memory Access — CPU sockets have local RAM; cross-socket access is slower.
Wear levelingSSD firmware spreads writes across blocks to even out flash wear. Causes hardware write amplification.

Data Structures

TermDefinition
Skip listProbabilistic balanced structure with O(log n) ops and lock-free-friendly properties. Used in LevelDB MemTable.
B-TreeSelf-balancing m-ary tree. Internal nodes store keys + values + child pointers. Used for indexes.
B+-TreeB-Tree variant where all values live in leaf nodes; internal nodes are pure routing. Used for tables in SQLite.
LSM-TreeLog-Structured Merge-Tree. In-memory MemTable + on-disk sorted runs (SSTables), merged via compaction.
Bloom filterProbabilistic set membership; no false negatives, tunable false positive rate. Used to skip SSTable lookups.
ARTAdaptive Radix Tree — modern in-memory index alternative to B-Trees, used by HyPer, DuckDB.

Consensus

TermDefinition
QuorumSubset of nodes whose agreement is required. Typically ⌊N/2⌋ + 1 for majority quorum.
Term / EpochMonotonically increasing identifier for a leadership period (Raft term, ZAB epoch, Paxos ballot).
Log indexPosition of an entry in the replicated log. Indices are monotonic and dense.
Commit indexThe largest log index known to be safely replicated to a quorum.
LinearizabilityStrongest consistency: operations appear to take effect atomically at some point between their invocation and response.
Sequential consistencyAll processes agree on a single global order, but the order need not match real-time.
Eventual consistencyIf updates stop, all replicas eventually agree. No real-time guarantees.
CAP theoremUnder a network partition, you must choose Consistency or Availability. Partition tolerance is non-negotiable.
FLP impossibilityNo deterministic asynchronous consensus protocol can guarantee progress with even one crash failure.
Lamport timestampScalar logical clock: L(a) < L(b) if a happened-before b. Cannot detect concurrency.
Vector clockPer-node vector. VC(a) < VC(b) iff every component is ≤. Detects concurrent events.
HLCHybrid Logical Clock: combines physical time with a logical counter; bounded skew from real time.

Transactions

TermDefinition
ACIDAtomicity, Consistency, Isolation, Durability — properties a transaction must satisfy.
Isolation levelREAD UNCOMMITTED → READ COMMITTED → REPEATABLE READ → SERIALIZABLE. Each rules out more anomalies.
Dirty readReading data written by an uncommitted transaction.
Non-repeatable readReading the same row twice in one tx and getting different values.
Phantom readA range query returns different rows when re-run within one tx.
MVCCMulti-Version Concurrency Control — writes create new versions; readers see a snapshot.
2PLTwo-Phase Locking — acquire locks in a growing phase, release in a shrinking phase. Guarantees serializability.
2PCTwo-Phase Commit — distributed transaction protocol: prepare phase, then commit/abort. Blocking on coordinator failure.

SQL Engine

TermDefinition
VDBEVirtual Database Engine — SQLite's bytecode VM that executes compiled SQL.
Prepared statementA parsed and compiled SQL statement, reusable with different parameters.
Cardinality estimationPredicting how many rows a query operator will produce. Core to the query planner.
SelectivityFraction of rows that satisfy a predicate. Low selectivity ⇒ index scan preferred.
Covering indexAn index that contains all columns needed by a query, so the table doesn't need to be touched.

Operational

TermDefinition
SnapshotA consistent point-in-time view of data. Used for backups, MVCC reads, Raft log compaction.
CheckpointOperation that flushes in-memory state to disk so recovery has less log to replay.
CompactionBackground process that merges sorted files (LSM) or reclaims fragmented space (B-tree).
YCSBYahoo Cloud Serving Benchmark — standard KV workload suite (A–F). Used in db-22.
JepsenTest framework for distributed systems correctness; injects partitions/clock skew. Inspires our consensus tests.

Toolchain Setup

All labs target Linux-first with macOS as a supported secondary platform. Windows is not supported (no io_uring, no O_DIRECT semantics we rely on; use WSL2 instead).

Required Versions

ToolMinimumRecommendedWhy
Rust1.781.82+std::io::IoSlice, stabilized OnceLock, edition 2021 features used throughout
Go1.221.23+range-over-func iterators, improved slices/maps stdlib, generics maturity
C++C++20C++20 (Clang 16+ / GCC 13+)Concepts, <bit> for endian ops, std::span, designated initializers
CMake3.283.29+CMAKE_CXX_MODULES, modern target_link_libraries semantics
clang-format1718+Consistent C++ formatting across labs
Python3.113.12+Benchmark plotting & verification scripts (matplotlib, pandas)

Per-Language Setup

Rust

# rustup is the canonical installer.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable
rustup component add clippy rustfmt
cargo install cargo-nextest        # faster, parallel test runner
cargo install cargo-flamegraph     # used in db-22

Verify:

rustc --version    # rustc 1.78.0 or newer
cargo --version

Go

# macOS
brew install go
# Linux: download from https://go.dev/dl/ — distro packages are usually old.

# Useful tools
go install honnef.co/go/tools/cmd/staticcheck@latest
go install golang.org/x/perf/cmd/benchstat@latest

Verify:

go version    # go1.22 or newer

C++

# macOS
xcode-select --install
brew install cmake ninja llvm

# Linux (Debian/Ubuntu)
sudo apt-get install -y build-essential cmake ninja-build clang-17 clang-format-17 \
                        libsnappy-dev liburing-dev

Verify:

clang++ --version   # Clang 16 or newer
cmake --version     # 3.28 or newer

Optional but recommended:

  • liburing-dev — required only for db-21 (io_uring lab) on Linux.
  • libsnappy-dev — used in db-06 (SSTable block compression).
  • valgrind / lldb — for memory and crash debugging.

Per-Lab Build Commands

Every lab src/<lang>/ is self-contained and has these commands:

LanguageBuildTestRun
Rustcargo build --releasecargo nextest run (or cargo test)cargo run --release --bin <name>
Gogo build ./...go test ./...go run ./cmd/<name>
C++cmake -B build -G Ninja && cmake --build buildctest --test-dir build./build/<name>

docs/execution.md in each lab repeats the exact commands with the lab-specific binary names.

OS-Specific Notes

Linux

  • io_uring requires kernel ≥ 5.1 (≥ 5.6 for most useful features). Check with uname -r.
  • O_DIRECT works on most filesystems but is rejected by tmpfs — use a real disk path in tests.
  • For accurate latency benchmarks, disable CPU frequency scaling: sudo cpupower frequency-set -g performance.

macOS

  • No io_uringdb-21 falls back to kqueue + worker pool. The lab explains the difference.
  • O_DIRECT does not exist; use F_NOCACHE via fcntl (the lab provides the wrapper).
  • fsync(2) does not guarantee data hits stable storage on macOS — use fcntl(F_FULLFSYNC). Labs handle this.

Editor / IDE

Any editor works. VS Code with these extensions is what the reference implementations were written in:

  • rust-lang.rust-analyzer
  • golang.go
  • llvm-vs-code-extensions.vscode-clangd
  • ms-vscode.cmake-tools

Sanity Check Script

Run this once after setup to verify everything works:

cd db-01-storage-primitives
( cd src/rust && cargo build --release ) && \
( cd src/go && go build ./... ) && \
( cd src/cpp && cmake -B build -G Ninja && cmake --build build ) && \
echo "All three toolchains OK."

Storage Primitives

The lab that earns you the right to talk about databases.

1. What Is It

This lab teaches the physical layer that every storage engine sits on top of: how data moves between a process's memory and a block device. You will learn the OS page model, the byte-order question (endianness), the three main file I/O styles (read/write, pread/pwrite, mmap, O_DIRECT), buffer alignment, and the durability primitive fsync. You'll also internalize the latency numbers for HDD, SATA SSD, and NVMe SSD that drive every storage design decision in the rest of the curriculum. The deliverable is a tiny page allocator plus a hexdump utility, written three times — once in Rust, once in Go, and once in C++ — exercising pread/pwrite against a real disk file.

2. Why It Matters

  • Every later lab depends on these primitives. LSM-trees, B-trees, WALs — they're all built on pread/pwrite/fsync and an understanding of the page cache.
  • Choosing the right I/O style changes throughput by 10–100×. A naïve read loop is not the same as pread from many threads, which is not the same as io_uring, which is not the same as mmap.
  • Hardware shapes the algorithm. LSM-trees exist because random writes on HDDs were catastrophic. NVMe IOPS now make some classic assumptions wrong. Knowing the numbers prevents cargo-culting designs from the wrong decade.
  • fsync is the single most expensive syscall in any database. Understanding when it must be called — and when you can amortize it — is the difference between 100 commits/sec and 100,000 commits/sec.

3. How It Works

                  User process
   ┌───────────────────────────────────────────────────┐
   │   Your code: page_allocator, db.put("key", val)   │
   │           buffer = [u8; PAGE_SIZE]                │
   └────────────┬───────────────────┬──────────────────┘
                │                   │
                │  pread/pwrite     │  mmap
                │  (explicit copy)  │  (page-fault driven)
                ▼                   ▼
        ┌─────────────────────────────────────┐
        │       Kernel page cache (RAM)       │  ← cached pages,
        │   4 KiB pages, indexed by inode+off │     LRU-evicted
        └────────────────┬────────────────────┘
                         │  block layer
                         │  (scheduler, mq-deadline / none for NVMe)
                         ▼
                ┌─────────────────────┐
                │  Device driver      │  fsync() blocks here
                │  (NVMe / SATA AHCI) │  until disk acks
                └─────────┬───────────┘
                          ▼
                ┌─────────────────────┐
                │  Storage hardware   │  HDD:  ~5 ms  random
                │  HDD / SSD / NVMe   │  SSD:  ~100 µs random
                │                     │  NVMe: ~50  µs random
                └─────────────────────┘

Three things to internalize from this picture:

  1. Without O_DIRECT, you always go through the kernel page cache. Your pread may hit a warm cache (memcpy speed) or cold cache (full disk I/O). Latency variance is enormous.
  2. fsync is the only way to tell the device to flush its own write cache. Without it, "the write returned" means "the kernel accepted it", not "it survives a power loss".
  3. mmap and pread are fundamentally different mental models. mmap makes I/O implicit (page faults), pread makes it explicit (syscalls). LMDB chose mmap. SQLite, LevelDB, and PostgreSQL chose pread/pwrite. We will use pread/pwrite for predictability, and discuss mmap in the analysis.

4. Core Terminology

TermDefinition
PageFixed-size unit of I/O between user and storage. The kernel uses 4 KiB; databases pick 4–32 KiB. We use 4 KiB.
Page cacheKernel-managed RAM that mirrors recently accessed file pages. Transparent to read/write and pread/pwrite.
pread(fd, buf, n, off)Read n bytes from fd starting at byte offset off. Does not affect the file pointer. Thread-safe.
pwrite(fd, buf, n, off)Write n bytes to fd at byte offset off. Thread-safe.
mmapMap a file region into the process's address space. Accesses become loads/stores; faults trigger page-ins.
fsync(fd)Block until all dirty data and metadata for fd are on stable storage. The durability primitive.
fdatasync(fd)Like fsync but may skip metadata updates that aren't required to retrieve the data.
O_DIRECTOpen flag (Linux) that bypasses the page cache. Requires 512-byte or 4-KiB alignment on buffers, offsets, sizes.
F_FULLFSYNCmacOS-only fcntl that actually flushes the drive's cache. fsync on macOS is not enough for true durability.
EndiannessByte order of multi-byte integers in memory. Little-endian = LSB first (x86, ARM default); big-endian = MSB first (network byte order).
AlignmentAn address being a multiple of N. Matters for SIMD, DMA, O_DIRECT, and many hardware operations.
SectorThe atomic write unit of the device. HDDs: 512 B (legacy) or 4 KiB (Advanced Format). NVMe: usually 4 KiB.
IOPSI/O operations per second. The right unit for random workloads (HDD ~150, SATA SSD ~80K, NVMe ~500K–1M).
LatencyTime for one operation to complete. Often what users actually feel; throughput hides tail behavior.

5. Mental Models

The page cache is a transparent cache, not a database. Think of pread like Map::get: if the key is in the cache, it's a memcpy; if it's not, the kernel goes to disk for you. You can't observe a cache miss with timing alone in production — that's the whole point of caches and the whole reason benchmarks lie.

fsync is a phone call to the disk. All other writes are "I told the postman" — fast, no guarantee. fsync is "I waited on the line while the disk confirmed the package arrived." Phone calls are slow. Group commits = bundling 100 packages into one call.

mmap is "make the file look like an array". pread is "I will ask for bytes one request at a time". The first is convenient. The second is predictable. Convenience and predictability are usually at war.

Sequential vs random I/O on an HDD is 100× different. On NVMe it's 2–3×. This is why LSM-trees won the 2000s and why "just append" got rediscovered in the 2010s and why NVMe makes some of those assumptions less critical in the 2020s. Hardware shapes design.

6. Common Misconceptions

  1. "write returning means my data is safe." False. The kernel buffered it. Only fsync (or fdatasync for data-only, or F_FULLFSYNC on macOS) guarantees durability.
  2. "mmap is faster than pread because there's no syscall." Often false. mmap access generates page faults, which are also context switches into the kernel, plus they're synchronous (you can't overlap them with computation as easily). LMDB-style designs win when the working set fits in RAM; they suffer on writes due to fsyncing the mapping.
  3. "SSDs make random vs sequential irrelevant." Partially true. Random reads are fast, but random writes still incur garbage collection and write amplification at the firmware level. Sequential writes still reduce hardware WA significantly.
  4. "4 KiB is always the right page size." No. It matches OS page size, which is friendly for mmap and for the page cache. But LevelDB uses 4 KiB blocks (read amp) and 64 MiB SSTables (sequential writes). PostgreSQL uses 8 KiB pages. The "right" page size depends on workload.
  5. "fsync flushes only my file." On many filesystems and many older kernels, fsync could flush more (or less) than expected. Modern ext4/xfs are sane, but historical PostgreSQL fsync bugs (2018) showed that the contract is more subtle than it looks.

7. Interview Talking Points

  • "For a write-heavy OLTP workload on local NVMe, I'd start with direct pwrite + fdatasync rather than mmap. mmap makes durable writes ambiguous — msync(MS_SYNC) is a heavier hammer than fdatasync because it covers the whole mapping, and you give up control over write ordering."
  • "My rule of thumb: HDD random read ≈ 5 ms, SATA SSD ≈ 100 µs, NVMe ≈ 50 µs, RAM ≈ 100 ns, L1 ≈ 1 ns. Every five orders of magnitude is where a different design becomes interesting. LSM-trees collapse the gap between random and sequential by converting random writes to sequential ones."
  • "fsync is what amortizes the difference between 100 commits/sec and 100,000 commits/sec. Group commit batches N concurrent transactions into one fsync, trading latency (one transaction may wait ~5 ms for a batch) for throughput (100× more committed transactions per fsync). Postgres, MySQL InnoDB, and SQLite all do this."
  • "O_DIRECT isn't a free win. You skip the page cache, so you have to implement your own cache and your buffers must be aligned. PostgreSQL deliberately uses the page cache and lets the OS do that work for it. Oracle and Sybase use O_DIRECT. The choice depends on whether you trust your buffer manager more than the kernel's."

8. Connections to Other Labs

  • db-02 — uses the page-aligned allocator from here for skip-list and hash-table node storage.
  • db-03 — the WAL is literally pwrite + fdatasync in a loop; this lab gives you the muscle memory.
  • db-06 — SSTable blocks are read via pread at known offsets; this lab is the read side.
  • db-11 — the SQLite pager is a pread/pwrite-based page cache; you'll reimplement what the kernel does for you here.
  • db-21 — revisits I/O with io_uring (Linux) and O_DIRECT for the advanced cases; this lab establishes the baseline.

References — Storage Primitives

Canonical Papers & Specifications

  • POSIX pread/pwrite/fsync — https://pubs.opengroup.org/onlinepubs/9699919799/functions/pread.html
  • Linux open(2) (for O_DIRECT, O_DSYNC) — https://man7.org/linux/man-pages/man2/open.2.html
  • Linux fsync(2) — https://man7.org/linux/man-pages/man2/fsync.2.html
  • Linux io_uring design — https://kernel.dk/io_uring.pdf (Jens Axboe, 2019). Read for db-21.
  • macOS F_FULLFSYNCman fcntl on macOS; see also Apple Tech Note TN1150.

Hardware Numbers

  • "Latency Numbers Every Programmer Should Know" — Jeff Dean, 2012. https://gist.github.com/jboner/2841832
  • "What Every Programmer Should Know About Memory" — Ulrich Drepper, 2007. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf (long but seminal)
  • NVMe specification — https://nvmexpress.org/specifications/ (skim §3 on queues, §4 on commands)

Battle Stories

  • "PostgreSQL's fsync surprise" — https://lwn.net/Articles/752063/. Why fsync semantics on Linux were subtler than database authors assumed. Read this.
  • "Files are Hard" — Dan Luu. https://danluu.com/file-consistency/. Survey of how filesystems can lose your data.
  • "mmap-based databases vs. read/write-based databases" — Andy Pavlo et al., "Are You Sure You Want to Use MMAP in Your Database Management System?", CIDR 2022. https://db.cs.cmu.edu/mmap-cidr2022/. Required reading if you ever consider mmap.

Implementation References

  • SQLite OS interface — https://www.sqlite.org/src/file/src/os_unix.c (search for unixSync to see real-world fsync handling, including the macOS F_FULLFSYNC workaround)
  • LevelDB env_posix.cc — https://github.com/google/leveldb/blob/main/util/env_posix.cc (look at PosixWritableFile::Sync)
  • LMDB — http://www.lmdb.tech/doc/ (the canonical mmap database; read for contrast)

Books

  • "Operating Systems: Three Easy Pieces" — Arpaci-Dusseau. Free at https://pages.cs.wisc.edu/~remzi/OSTEP/. Chapters 39–44 (persistence) are exactly this lab.
  • "Designing Data-Intensive Applications" — Martin Kleppmann, O'Reilly. Chapter 3 ("Storage and Retrieval") frames the LSM vs B-tree debate that drives Phases 2 and 3.

Analysis — Storage Primitives

This document is for the design decisions and the trade-offs. The CONCEPTS.md told you what exists; this tells you why we picked one over the other and what you'd reach for in different conditions.


Decision 1: pread/pwrite over read/write

We use pread/pwrite (explicit offsets) instead of read/write + lseek.

Aspectread/write + lseekpread/pwrite
Thread safety on shared fdUnsafe — file pointer is shared, lseek+read racesSafe — offset is per-call
Syscalls per op2 (lseek + read)1
Mental modelStateful cursorStateless random access
Used bySingle-threaded streaming codeAll real databases (SQLite, LevelDB, Postgres)

Verdict: pread/pwrite is strictly better for database-style access patterns. The only reason to use the cursor variant is when you genuinely have a single sequential reader (e.g., tail -f).


Decision 2: pread/pwrite over mmap

This is more nuanced. We use explicit I/O for all labs except where we deliberately study mmap.

Aspectmmappread/pwrite
Code complexityLower (pointer access)Higher (explicit calls)
Latency predictabilityBad — page faults are synchronous, can stall on cold pagesGood — every cost is visible in the syscall
Write durabilityTricky — msync(MS_SYNC) is expensive and synchronizes the whole mappingSurgical — fdatasync(fd)
Memory accountingCounts as anonymous memory; hard to reason about WSSBuffers are yours, you bound them
Large files (> RAM)Catastrophic — random page-in stormsFine — you read what you need
Multi-threaded scalingPage-fault locks scale poorlyLinear scaling with cores
TLB pressureHugepages help but transparent hugepage transitions can pause processesNone
Used byLMDB, BoltDBSQLite, LevelDB, RocksDB, Postgres

The Pavlo et al. CIDR 2022 paper (linked in references.md) is the definitive teardown. TL;DR: mmap is fine when (a) the dataset fits in RAM, (b) the workload is read-heavy, and (c) you don't care about latency tails. For everything else, pread/pwrite wins.


Decision 3: Page Size = 4 KiB

We pick 4 KiB as the default page size in this lab and reconsider in later labs.

Page sizeProsCons
512 BOld HDD sector; small writes are cheapTiny metadata-to-data ratio, lots of indirection
4 KiBMatches OS page, NVMe LBA, page cache. Sweet spot for OLTP.Small for analytics
8 KiBPostgres default. Better for slightly larger rows.Wastes I/O for tiny tuples
16 KiBMySQL InnoDB default. Good index fanout.One row update = 16 KiB write
64 KiB / 1 MiBAnalytics, sequential scans, Parquet row groupsTerrible for random updates

Rule of thumb: page size ≈ the device sector size × small constant. With NVMe at 4 KiB LBA and the OS page also at 4 KiB, going smaller is fighting the hardware and going larger is amortizing a smaller win.


Decision 4: Endianness on Disk

Our on-disk format is little-endian. Justified by:

  1. x86 and ARM (in normal mode) are little-endian. Big-endian on these platforms means a byte swap on every read.
  2. Network protocols use big-endian by convention, but our on-disk format is not a network protocol — it's only read by the same machine (or by an explicit migration tool).
  3. LevelDB and RocksDB use little-endian for fixed-width fields, with varints for variable-width. We follow that convention for compatibility of mental model.
  4. SQLite uses big-endian for historical reasons (the format dates to 2000, when MIPS/PowerPC/SPARC were still common). It's a legitimate alternative; we just optimize for the modern hardware reality.

Always use explicit conversion functions at the I/O boundary. Never memcpy an int to disk and hope. Our Rust code uses u64::to_le_bytes; Go uses encoding/binary.LittleEndian; C++20 uses std::endian + std::byteswap.


Decision 5: When to call fsync

The cost of fsync on consumer NVMe is roughly:

  • Single-threaded latency: ~50 µs–1 ms depending on outstanding writes
  • Throughput-limited: roughly 5,000–20,000 fsync/sec before contention dominates

The cost on a HDD is 5–15 ms per fsync. That's why ye olde databases did group commit.

The right policy depends on durability requirements:

PolicyWhat survives a crashThroughput cost
No fsyncNothing reliably (kernel may flush eventually)None
fsync per writeEvery acknowledged writeMassive — one syscall per write
fsync per N writesLast (N-1) writes possibly lost1/N the cost
Group commitEvery acknowledged write; latency = time-to-batch + fsyncExcellent — best of both
fsync periodically (e.g., 100 ms)Last 100 ms of writes possibly lost (MySQL innodb_flush_log_at_trx_commit=2)Excellent

The right design for Phase 2's WAL is group commit: when a writer finishes, it waits on a condition variable; the WAL writer thread pwrites pending records, fdatasyncs once, then wakes every waiter. We'll build this in db-03.

macOS Caveat

On macOS, fsync(fd) does not flush the drive's write cache — it only sends the data to the drive. To get true durability you must call fcntl(fd, F_FULLFSYNC), which can be 10–100× slower than fsync on the same hardware. SQLite, LevelDB, and Postgres all handle this. Our wrapper in src/*/fsync_full.* does the platform dispatch.


Decision 6: O_DIRECT — Not in This Lab

We don't use O_DIRECT in Lab 01 because:

  1. It requires aligned buffers (typically 4 KiB), aligned offsets, and aligned I/O sizes.
  2. It bypasses the page cache, so you must implement your own — which is a buffer manager (Phase 3, db-11).
  3. It's not available on macOS — you'd use fcntl(fd, F_NOCACHE, 1) as the closest analogue, but it has weaker semantics.

We revisit O_DIRECT in db-21-storage-engine-advanced once we have a buffer manager worth talking about.


Hardware Numbers Cheat-Sheet

Memorize these. They drive every storage design decision:

L1 cache hit            1   ns
Branch mispredict       3   ns
L2 cache hit           ~4   ns
L3 cache hit          ~15   ns
DRAM access          ~100   ns       — 100× L1
Context switch       ~1–5  µs
NVMe random read      ~50  µs        — 500× DRAM
NVMe sequential read  ~5   µs/4KB
SATA SSD random read ~100  µs
SATA SSD seq read    ~10   µs/4KB
HDD random read       ~5   ms        — 100,000× DRAM
HDD sequential read  ~30   µs/4KB    (~150 MB/s)
fsync on NVMe       ~50 µs–1 ms
fsync on HDD          ~10 ms
F_FULLFSYNC (macOS)   ~10–50 ms      — actually flushes drive cache
Network RTT same DC   ~500 µs
Network RTT same region ~1 ms
Network RTT cross-region ~50–150 ms  — drives Raft heartbeat tuning in db-17

Five-order-of-magnitude gaps are where the design changes. Between L1 and DRAM (100×), you can ignore it. Between DRAM and disk (1000×), you can't. Between disk and network cross-region (1000× again), distributed systems get hard.


What Breaks at Scale

  • Filesystem journal contention: fsync on ext4 serializes through the FS journal. Many concurrent fsyncs on the same FS don't scale linearly. Mitigation: one WAL file per database, dedicated FS for WAL.
  • Page cache thrashing: when working set > RAM, every pread is a miss. The kernel's LRU is generic; an app-aware cache (Phase 2's block cache, Phase 3's pager) does better.
  • fsync failure handling: on Linux, a failed fsync can mark dirty pages as clean — silently losing your data. This is the "fsyncgate" referenced in the references. Mitigation: panic on fsync error and crash-recover from the WAL (modern Postgres does this).
  • NVMe queue depth: NVMe shines with QD=32–128 in flight. A single-threaded pread loop runs at QD=1 and leaves most of the drive idle. io_uring (Phase 5) fixes this.

Execution — Storage Primitives

Prerequisites

You've completed the toolchain setup in ../../TOOLS.md. To confirm:

rustc --version      # ≥ 1.78
go version           # ≥ 1.22
clang++ --version    # ≥ 16  (or g++ ≥ 13)
cmake --version      # ≥ 3.28

Quick Start — All Three Languages

From the lab root:

# Rust
( cd src/rust && cargo build --release )
./src/rust/target/release/pagealloc write /tmp/lab01.bin 0 "hello, disk"
./src/rust/target/release/pagealloc read  /tmp/lab01.bin 0
./src/rust/target/release/pagealloc hexdump /tmp/lab01.bin

# Go
( cd src/go && go build -o /tmp/pagealloc-go ./cmd/pagealloc )
/tmp/pagealloc-go write /tmp/lab01.bin 0 "hello, disk"
/tmp/pagealloc-go read  /tmp/lab01.bin 0
/tmp/pagealloc-go hexdump /tmp/lab01.bin

# C++
( cd src/cpp && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build build )
./src/cpp/build/pagealloc write /tmp/lab01.bin 0 "hello, disk"
./src/cpp/build/pagealloc read  /tmp/lab01.bin 0
./src/cpp/build/pagealloc hexdump /tmp/lab01.bin

All three binaries are byte-compatible — write with one, read with another, get the same bytes.

CLI Reference (all three implementations)

CommandEffect
pagealloc write <file> <page_no> <ascii_string>Write the ASCII bytes (zero-padded to 4 KiB) into page page_no. Calls fsync before returning.
pagealloc read <file> <page_no>pread page page_no (4 KiB), print bytes up to first null.
pagealloc hexdump <file>Walk the whole file 4 KiB at a time and print a canonical hex dump (16 bytes/line).
pagealloc bench <file> <pages> <iters>Random pread benchmark: file is preallocated to pages pages, then iters random reads are timed.

Tests

# Rust
( cd src/rust && cargo test )

# Go
( cd src/go && go test ./... )

# C++
( cd src/cpp && cmake --build build && ctest --test-dir build --output-on-failure )

Each test suite covers:

  1. Round-trip: write then read returns the same bytes.
  2. Cross-implementation: a file written by Rust must read correctly with the Go and C++ binaries (run by scripts/cross_test.sh).
  3. fsync is called on write (verified by strace -e fsync in the cross_test script on Linux).
  4. Endianness sanity: the page header uses little-endian and is identical across implementations.

Environment Variables

VariableDefaultEffect
DSE_PAGE_SIZE4096Override page size (must be a power of two). Only consume in the pagealloc bench subcommand.
DSE_FSYNC1If 0, skip fsync on write. Only for benchmarking — never in production.

Observation — Storage Primitives

How to look inside the page cache, watch syscalls, measure latency, and prove to yourself that your code is doing what you think.

Looking at the Page Cache

Linux

# What's in the page cache for our file? (Requires `pcstat` or vmtouch.)
go install github.com/tobert/pcstat/pcstat@latest
pcstat /tmp/lab01.bin

# Drop the page cache (requires root) — to test "cold" reads.
sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

macOS

# `purge` drops the unified buffer cache (requires admin password).
sudo purge

# `fs_usage` is the macOS strace-for-files.
sudo fs_usage -w -f filesys ./src/rust/target/release/pagealloc

Watching Syscalls

# Linux
strace -e trace=openat,pread64,pwrite64,fsync,fdatasync \
       ./src/go/pagealloc-go write /tmp/lab01.bin 0 "hello"

# macOS  (sudo required for dtrace)
sudo dtruss -f -t pread,pwrite,fsync ./src/cpp/build/pagealloc write /tmp/lab01.bin 0 "hello" 2>&1

You should see — in order:

openat(AT_FDCWD, "/tmp/lab01.bin", O_RDWR|O_CREAT, 0644) = 3
pwrite64(3, "hello\0\0...", 4096, 0)                    = 4096
fdatasync(3)                                            = 0
close(3)                                                = 0

If you see read(3, ...) without an offset, you're using buffered I/O — that's wrong for this lab. If you see no fsync/fdatasync, your durability is fake.

Measuring Latency

The bench subcommand measures cold-cache and warm-cache pread latency:

# Preallocate a 100 MB file, then do 10000 random 4 KiB reads.
./src/rust/target/release/pagealloc bench /tmp/lab01.bin 25600 10000

Expected output:

preallocated: 25600 pages = 102400 KiB
warm-cache reads:   p50=3.1 µs   p99=8.4 µs   throughput=315 MB/s
dropped page cache
cold-cache reads:   p50=78 µs    p99=210 µs   throughput=51 MB/s

The exact numbers depend on your hardware. The shape matters:

  • Warm p50 ≈ 1–5 µs: that's a memcpy from the page cache. No actual disk I/O.
  • Cold p50 ≈ 50–200 µs on NVMe, 5–15 ms on a spinning disk.
  • p99 > 10× p50: latency tails are real; this motivates io_uring and dedicated I/O threads.

Profiling Tools

Rust

cargo install cargo-flamegraph
cd src/rust
cargo flamegraph --release --bin pagealloc -- bench /tmp/lab01.bin 25600 100000
# open flamegraph.svg in your browser

Go

cd src/go
go test -bench=BenchmarkPread -cpuprofile=cpu.prof ./...
go tool pprof -http=:8080 cpu.prof

C++

# Linux
perf record -F 999 -g ./src/cpp/build/pagealloc bench /tmp/lab01.bin 25600 100000
perf report

# macOS (use Instruments.app or sample)
sample pagealloc 5 -file /tmp/sample.txt

Watching Disk Throughput

# Linux  (iostat from sysstat package)
iostat -dx 1 nvme0n1

# macOS
sudo fs_usage -w -f diskio

While running pagealloc bench, watch r/s (reads per second), rkB/s, and await (avg I/O latency in ms). For NVMe, expect r/s to plateau in the thousands for QD=1; you'd need io_uring (Lab 21) to push it into the hundreds of thousands.

Verifying Endianness

# Write the integer 0x01020304 into a fresh file (we'll write it as bytes via hexdump).
./src/rust/target/release/pagealloc write /tmp/endian.bin 0 ""
# In a separate REPL session, use whichever language you prefer to write a binary u32 to the file.
# Then xxd the file:
xxd /tmp/endian.bin | head -1

A little-endian system writes 04 03 02 01 for the value 0x01020304. If you see 01 02 03 04, either your machine is big-endian (unlikely on x86/ARM) or your code is using to_be_bytes somewhere.

Verification — Storage Primitives

The pass/fail checks for this lab. If all eight pass for all three implementations, you are done.

Per-Implementation Checks

For each of src/rust, src/go, src/cpp:

V1 — Builds

# Rust
( cd src/rust && cargo build --release ) && echo "RUST OK"
# Go
( cd src/go && go build ./... ) && echo "GO OK"
# C++
( cd src/cpp && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build build ) && echo "CPP OK"

V2 — Unit Tests Pass

( cd src/rust && cargo test --release ) && echo "RUST TESTS OK"
( cd src/go   && go test ./... )         && echo "GO TESTS OK"
( cd src/cpp/build && ctest --output-on-failure ) && echo "CPP TESTS OK"

V3 — Round-Trip

# Per binary:
$BIN write /tmp/v3.bin 5 "hello, lab"
$BIN read  /tmp/v3.bin 5 | grep -q "^hello, lab$" && echo "V3 OK"

V4 — fsync Is Called (Linux only)

strace -e fsync,fdatasync -o /tmp/syscalls.log $BIN write /tmp/v4.bin 0 "x"
grep -E 'fsync|fdatasync' /tmp/syscalls.log && echo "V4 OK"

(Expected: at least one of fsync(...) or fdatasync(...) in the trace. On macOS substitute sudo dtruss -t fsync.)

Cross-Implementation Checks

V5 — Byte-Compatibility

Files written by one implementation must read identically with the others.

RUST=./src/rust/target/release/pagealloc
GO=./src/go/pagealloc-go
CPP=./src/cpp/build/pagealloc

$RUST write /tmp/v5.bin 3 "cross-lang ok"
$GO   read  /tmp/v5.bin 3 | grep -q "^cross-lang ok$" && echo "GO read RUST OK"
$CPP  read  /tmp/v5.bin 3 | grep -q "^cross-lang ok$" && echo "CPP read RUST OK"

$GO   write /tmp/v5.bin 7 "go writes"
$CPP  read  /tmp/v5.bin 7 | grep -q "^go writes$" && echo "CPP read GO OK"
$RUST read  /tmp/v5.bin 7 | grep -q "^go writes$" && echo "RUST read GO OK"

V6 — Hexdump Identical

$RUST hexdump /tmp/v5.bin > /tmp/v6.rust.hex
$GO   hexdump /tmp/v5.bin > /tmp/v6.go.hex
$CPP  hexdump /tmp/v5.bin > /tmp/v6.cpp.hex
diff /tmp/v6.rust.hex /tmp/v6.go.hex && diff /tmp/v6.rust.hex /tmp/v6.cpp.hex && echo "V6 OK"

V7 — Endianness Sanity

The first 8 bytes of each non-empty page should be a little-endian magic constant 0x44534531_50414745 (DSE1PAGE reversed):

xxd -l 8 /tmp/v5.bin | head -1
# Expected: 00000000: 4547 4150 3145 5344

If you see 4453 4531 5041 4745, your implementation is writing big-endian — fix that.

V8 — Benchmark Smoke

$RUST bench /tmp/v8.bin 1024 1000
# Expected: prints both warm-cache and cold-cache p50/p99 lines without crashing.

Master Script

A single command to run everything (provided as scripts/verify.sh):

bash scripts/verify.sh

Expected output ends with:

====================================================
ALL 8 CHECKS PASSED for RUST, GO, CPP
====================================================

If any check fails, the script exits non-zero and prints which check + which implementation failed.

Broader Ideas — Storage Primitives

Where to go after this lab if you want to push deeper. Each idea is a self-contained extension or alternative.

1. Replace pread with io_uring (Linux)

The single biggest jump from this lab's design to a modern engine is moving from synchronous syscalls to async submission queues. With pread at QD=1, NVMe runs at ~5% of its IOPS. With io_uring at QD=32+, it hits the spec sheet.

  • Lab pointer: db-21-storage-engine-advanced does this end-to-end.
  • Self-study: implement a pread_async API now that internally still uses pread but queues requests through a crossbeam channel (Rust) / goroutine pool (Go) / std::jthread worker pool (C++). When you then swap the backend for io_uring, no API consumer changes.
  • Reference: Jens Axboe's "Efficient IO with io_uring" (https://kernel.dk/io_uring.pdf), §3.

2. Page Layout — Slotted Pages vs Fixed-Size Records

Our pages are zero-padded ASCII. Real engines use slotted pages:

┌────────┬────────────────────────────┬──────┐
│ header │ slot[0] slot[1] ...        │ free │
│        │ → offsets into page        │      │
├────────┴────────────────────────────┴──────┤
│ ← record N ← record 1 ← record 0           │  (grows from end)
└────────────────────────────────────────────┘

This lets variable-length records share a page without external fragmentation. PostgreSQL, MySQL InnoDB, and SQLite all use slotted pages. Try this: extend pagealloc so each page holds a slot directory and stores up to 16 variable-length keys per page. This is the warm-up for db-10.

3. Copy-on-Write Pages (LMDB-style)

Instead of overwriting a page in place, allocate a fresh page and update the parent to point to it. This is how LMDB achieves single-writer MVCC without a WAL. Pros: simpler crash recovery (just point at the last committed root). Cons: requires a GC for unreferenced pages, doubles write amplification.

  • Reference: Howard Chu's LMDB tech docs, http://www.lmdb.tech/doc/
  • Self-study: extend the allocator to track free pages and never overwrite; introduce a "commit" op that just writes a new root pointer atomically.

4. Write Coalescing & Group Commit

Right now every write calls fsync immediately. Even a single concurrent writer benefits from group commit:

#![allow(unused)]
fn main() {
// Pseudocode
let mut pending = vec![];
loop {
    pending.push(receive_write_request());
    if elapsed_since_last_fsync > 100us || pending.len() > 64 {
        pwrite_all(pending);
        fdatasync();
        for req in pending.drain(..) { req.notify_done(); }
    }
}
}
  • Lab pointer: db-03-write-ahead-log builds this for the WAL. Try it here as warm-up.
  • Trade-off: latency increases by 100us, throughput rises by ~50× under contention.

5. Direct I/O + Aligned Buffers

O_DIRECT (Linux) or F_NOCACHE (macOS) bypasses the page cache. To use it you need 4-KiB-aligned buffers (in Rust: Layout::from_size_align(4096, 4096)?; in C++: posix_memalign(&buf, 4096, 4096); in Go: trickier — use golang.org/x/sys/unix.Mmap with MAP_ANON).

  • When this matters: when your app has a better cache than the kernel (e.g., Phase 2's block cache). Oracle, MySQL with O_DIRECT, and most flash-tuned engines pick this.
  • Self-study: add a pagealloc write-direct subcommand that opens with O_DIRECT and demonstrates the alignment requirement (the program must fail predictably if the buffer is unaligned).

6. Sparse Files & Hole Punching

Files don't have to be contiguous. fallocate(FALLOC_FL_PUNCH_HOLE) releases blocks back to the filesystem without changing the file size. Useful for LSM-tree SSTable compaction (free space after removing dead keys) and for journal log truncation.

  • Reference: man 2 fallocate
  • Self-study: add pagealloc punch <file> <page_no> and verify with du -h <file> that the file's apparent size is unchanged but on-disk size shrinks.

7. Crash Testing with dm-flakey (Linux)

The hard part of storage code is testing the failure cases. dm-flakey is a Linux device-mapper target that simulates random write failures.

# 5-second window of normal operation, then 1 second of dropping writes, repeat.
sudo dmsetup create flakey-dev --table "0 $size flakey /dev/loop0 0 5 1 1 drop_writes"

Mount your test filesystem on /dev/mapper/flakey-dev and run your pagealloc write loop across the drop window. Without fsync, you should lose data. With fsync, the writes that completed should survive. This is how the real engines test durability.

8. Comparing mmap Yourself

We argued for pread/pwrite. Don't take our word for it — implement pagealloc-mmap as a fourth implementation. Compare:

Workloadpreadmmap
Sequential read of 1 GB??
Random read of 4 KiB × 10⁶ from a 1 GB file (warm)??
Random read from a 100 GB file (cold)??
10⁵ random writes with durability??

Plot the results, write down what surprised you. Bring those numbers to the mmap Pavlo paper (in references.md) and check whether they match.

9. Persistent Memory (PMEM, Optane)

Intel Optane is dead, but Persistent Memory programming patterns survive in CXL.mem and in research kernels. PMEM is byte-addressable like RAM, persistent like SSD, with clwb + sfence as the durability primitive (no fsync). The persistent memory programming library (PMDK) is what to read.

  • Reference: https://pmem.io/pmdk/
  • Why it matters: if/when CXL persistent memory becomes commodity, every storage engine in this curriculum will need a rewrite. Already, WiscKey, SplitFS, and uTree are research designs assuming PMEM.

10. Beyond Disk: Object Storage as a Backing Store

Modern cloud-native databases (Snowflake, Databricks, BigQuery) don't pwrite to local disks — they PUT 4 MiB objects to S3. The trade-offs are wildly different (high latency, infinite throughput, eventual consistency until 2020). The closest "primitives lab" for that world would replace pread/pwrite with HTTP range requests. Worth thinking about, especially before db-23's capstone.

  • Reference: "Lakehouse: A New Generation of Open Platforms" (Armbrust et al., CIDR 2021)

Step 1 — Open a File and Write Bytes

Goal

Build the smallest possible thing that touches the disk: open a file, write some bytes at a known offset, close the file. You'll do this three times — once in Rust, Go, and C++ — so you can feel how each language exposes the same pread/pwrite/fsync primitives.

Prerequisites

  • Toolchain installed per ../../TOOLS.md.
  • An empty editor and a terminal in this lab's directory.

What You're Building

A function with this signature (conceptually):

write_page(path: string, page_no: u64, bytes: [u8]) -> Result
  • Opens (or creates) path for read+write.
  • Computes offset = page_no * PAGE_SIZE (with PAGE_SIZE = 4096).
  • Zero-pads bytes to exactly PAGE_SIZE.
  • pwrites the padded buffer at offset.
  • Calls fdatasync (or fsync if fdatasync is unavailable).
  • Closes the file.

Why pwrite, not write

The classic POSIX write syscall uses the file's seek pointer (lseek). That makes it stateful — two threads writeing to the same fd will race. pwrite takes an explicit offset and is thread-safe. Every database in this curriculum uses pwrite. No lseek in our code, ever.

Why PAGE_SIZE = 4096

It matches the OS page size on x86_64 and ARM64, which means the kernel page cache, the device LBA, and your write are all the same unit. Mismatched sizes cause read-modify-write at the kernel layer: writing 100 bytes requires the kernel to first read the 4 KiB page containing those bytes, modify, and write back. By always writing a full page, you avoid that hidden cost.

Why fdatasync Over fsync

fsync flushes data and metadata (file size, modification time). For a write that doesn't change the file size — the common case in a steady-state database — fdatasync skips the metadata flush, saving a few hundred microseconds per call on average. Use fdatasync when you can.

Rust Implementation

In ../src/rust/src/lib.rs we use the std::os::unix::fs::FileExt::write_at extension, which compiles to pwrite64 on Linux and macOS. Look at the function write_page.

Key idiom:

#![allow(unused)]
fn main() {
use std::os::unix::fs::FileExt;
file.write_all_at(&buf, offset)?;
file.sync_data()?;   // == fdatasync
}

sync_data is Rust's portable name for fdatasync on Linux and fcntl(F_FULLFSYNC) on macOS (Rust 1.78+ uses F_BARRIERFSYNC on macOS, which is a faster middle ground).

Go Implementation

In ../src/go/pagealloc.go, the WriteAt method is pwrite, and f.Sync() is fsync. There is no first-class fdatasync in os, so we call unix.Fdatasync(fd) from golang.org/x/sys/unix.

if _, err := f.WriteAt(buf, offset); err != nil { return err }
return unix.Fdatasync(int(f.Fd()))

On macOS, unix.Fdatasync is not exported (the kernel doesn't have it). We fall back to unix.FcntlInt(fd, unix.F_FULLFSYNC, 0). The wrapper in fsync_full.go handles the platform branch.

C++ Implementation

In ../src/cpp/src/pagealloc.cc:

ssize_t n = ::pwrite(fd, buf.data(), buf.size(), offset);
if (n != static_cast<ssize_t>(buf.size())) return std::errc::io_error;
::fdatasync(fd);

On macOS we use ::fcntl(fd, F_FULLFSYNC). The dispatch is in fsync_full.cc.

Try It

cd src/rust && cargo build --release
./target/release/pagealloc write /tmp/step1.bin 0 "first page"
xxd -l 32 /tmp/step1.bin

Expected output:

00000000: 4547 4150 3145 5344 0000 0000 0000 0000  EGAP1ESD........
00000010: 6669 7273 7420 7061 6765 0000 0000 0000  first page......

The first 8 bytes are our little-endian page magic 0x44534531_50414745 (read as bytes left-to-right: 45 47 41 50 31 45 53 44). Bytes 16+ contain your ASCII payload "first page" followed by zero-padding to 4 KiB.

What Just Happened

  1. You opened a file (open(2) with O_RDWR | O_CREAT).
  2. You wrote exactly one page at exactly one offset (pwrite(2)).
  3. You forced the data to stable storage (fdatasync(2) or F_FULLFSYNC on macOS).
  4. You closed the fd, which does not flush — close(2) returns immediately.

On a power loss between step 3 and step 4, your write survives. Without step 3, it might not.

Next

In Step 2 you'll add the read path and a hexdump utility, and verify that all three implementations produce byte-identical files.

Step 2 — pread and Hexdump

Goal

Implement the read side (pread) and a hexdump utility, then prove cross-implementation byte-compatibility: a file written by Rust must read identically from Go and C++.

The Read Side

Symmetric to Step 1:

read_page(path: string, page_no: u64) -> [u8; PAGE_SIZE]
  • Open the file read-only.
  • pread(fd, buf, PAGE_SIZE, page_no * PAGE_SIZE).
  • Return the buffer (caller will strip trailing zeros or use the magic header to validate).

Rust

#![allow(unused)]
fn main() {
file.read_exact_at(&mut buf, offset)?;
}

Go

n, err := f.ReadAt(buf, offset)
if err != nil && err != io.EOF { return nil, err }

ReadAt returns io.EOF if n < len(buf) — this is normal for the last page of a file that hasn't been preallocated. Tests handle this case.

C++

ssize_t n = ::pread(fd, buf.data(), buf.size(), offset);
if (n < 0) return std::errc::io_error;
buf.resize(n);   // shrink if short read

Page Header Format

Every non-empty page in our format begins with a 16-byte header:

offset  size  field
------  ----  -----
   0     8    magic = 0x44534531_50414745  (LE: 45 47 41 50 31 45 53 44 ; ASCII reversed: "EGAP1ESD")
   8     2    version = 1
  10     2    flags = 0
  12     4    payload_len  (LE u32, number of bytes used after the header)
  16     n    payload bytes
n+16     —    zero-pad to PAGE_SIZE

This is a deliberately simple format — we'll grow it in later labs. For now it gives us:

  • A magic number to detect "is this a valid page?"
  • A version field to evolve the format later.
  • An explicit payload length so we don't have to scan for zeros (zeros are valid bytes in real data).

The Hexdump Utility

A canonical 16-byte-per-line hex dump:

00000000: 4547 4150 3145 5344 0100 0000 0a00 0000  EGAP1ESD........
00000010: 6669 7273 7420 7061 6765 0000 0000 0000  first page......
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
...

Format spec:

  • 8-digit hex offset, then : .
  • 16 bytes per line, grouped 2 bytes per word, separated by single space.
  • 2-space gap before the ASCII rendering.
  • ASCII rendering: printable ASCII as itself, otherwise ..

This format matches xxd output for easy diff-based cross-language verification.

Cross-Implementation Test

This is the most important check in this lab. Run scripts/cross_test.sh:

bash scripts/cross_test.sh

What it does (excerpt):

$RUST write /tmp/xt.bin 0 "from rust"
$GO   write /tmp/xt.bin 1 "from go"
$CPP  write /tmp/xt.bin 2 "from cpp"

$RUST hexdump /tmp/xt.bin > /tmp/h.rust
$GO   hexdump /tmp/xt.bin > /tmp/h.go
$CPP  hexdump /tmp/xt.bin > /tmp/h.cpp

diff /tmp/h.rust /tmp/h.go || { echo "RUST/GO mismatch"; exit 1; }
diff /tmp/h.rust /tmp/h.cpp || { echo "RUST/CPP mismatch"; exit 1; }
echo "cross-language byte-compat OK"

If this fails, the most common bugs are:

  1. Wrong endianness on the magic or payload_len.
  2. Forgetting to zero-pad the page (one impl leaves junk past the payload).
  3. Off-by-one on the offset calculation (page_no * PAGE_SIZE vs (page_no + 1) * PAGE_SIZE).

What Just Happened

You now have a portable, file-format-compatible storage primitive across three languages. This is the foundation for every later lab — the WAL in db-03 is exactly this with append-only semantics, and the SSTable in db-06 is this with a richer block format.

Next

Step 3: measure latency, demonstrate the page cache, and understand why your second read of a page is 1000× faster than the first.

Step 3 — Benchmark and the Page Cache

Goal

See the page cache with your own eyes by measuring warm-cache and cold-cache pread latency. This is the experiment that should make you suspicious of every microbenchmark you ever read.

The Benchmark

pagealloc bench <file> <pages> <iters>:

  1. Preallocate file to pages * 4 KiB using a sequential write loop.
  2. fsync to make sure it's on disk.
  3. Time iters random preads of one page each.
  4. Drop the page cache.
  5. Time iters random preads again.
  6. Print p50/p99/p999 for each phase plus throughput in MB/s.

Implementation lives in:

Dropping the Page Cache

On Linux:

sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Our benchmark binary calls this automatically if it can. If it can't (no sudo), it prints a warning and skips the cold phase.

On macOS:

sudo purge

Same logic — the binary attempts it, warns if it can't.

Expected Numbers

On a modern laptop with NVMe:

$ ./pagealloc bench /tmp/bench.bin 65536 50000   # 256 MB file, 50k iters
preallocated 65536 pages = 262144 KiB

WARM cache:
  iterations : 50000
  p50        : 1.2 µs
  p99        : 5.8 µs
  p99.9      : 18 µs
  throughput : 1840 MB/s

dropped page cache

COLD cache:
  iterations : 50000
  p50        : 64 µs
  p99        : 180 µs
  p99.9      : 340 µs
  throughput : 56 MB/s

Two observations:

  1. Warm cache is ~50× faster than cold. The page cache makes microbenchmarks lie. If you benchmarked a database after running the benchmark for warmup, you measured memcpy, not disk.
  2. p99 is 4–5× p50 even on cold cache. Latency tails come from queue depth, kernel scheduling, NVMe garbage collection. This motivates io_uring (Lab 21) and request hedging in distributed systems (Lab 20).

On a Spinning Disk (if you have one)

COLD cache:
  p50        : 6.4 ms     ← 100× worse than NVMe
  p99        : 18 ms
  throughput : 0.6 MB/s   ← versus 56 MB/s on NVMe

This 100× gap is why LSM-trees exist. Random reads on HDD are unworkable for OLTP; engines either:

  • Avoid them (sequential append-only logs).
  • Hide them behind cache (large block caches + bloom filters).
  • Punt to SSD (HDD as cold tier only).

Throughput vs Latency

Watch what happens with iters = 1000 vs iters = 100000:

$ ./pagealloc bench /tmp/bench.bin 65536 1000
WARM throughput : 4200 MB/s

$ ./pagealloc bench /tmp/bench.bin 65536 100000
WARM throughput : 1800 MB/s

Higher iteration counts include more cache eviction (as the random distribution gradually evicts pages we already cached), exposing memory bandwidth and TLB misses. Real workloads sit between these. A single benchmark number is almost always wrong.

Try This

Add a flag to control the access pattern: sequential vs random. Sequential preads benefit from the kernel's read-ahead heuristic. On the same NVMe device you should see:

RANDOM     cold : 56 MB/s
SEQUENTIAL cold : 2400 MB/s    ← 40× faster, all due to read-ahead

This is why scans are cheap and point lookups are expensive — even on SSD.

What Just Happened

You measured the page cache, the access pattern's effect on throughput, and the gap between p50 and p99. These three insights drive every storage design in this curriculum:

  • Page cache exists → your in-process block cache (Lab 8) must be smarter than LRU on raw bytes, otherwise you're duplicating the kernel's work.
  • Sequential >> random → LSM-tree compaction (Lab 7) sorts data on disk to convert all future reads to sequential ranges.
  • p99 >> p50 → consensus heartbeats (Lab 17) must tolerate occasional 100× slow fsyncs without triggering elections.

Next

You've finished Lab 01. Run docs/verification.md to confirm all 8 checks pass. Then move on to db-02-data-structures-for-storage.

Data Structures for Storage

Status: Complete. Companion to db-01 and prerequisite for db-05 (MemTable) and db-10 (B-Tree).

1. What Is It

This lab is about the in-memory data structures that databases use, and why those choices change completely when the data is on disk. We build two structures from scratch:

  • A skip list — an ordered, probabilistic, pointer-based map. This is what LevelDB and RocksDB use for their MemTable, and what Redis uses for sorted sets.
  • A hash table with open-addressing + Robin Hood probing — an unordered, array-backed map. This is what you use when you need O(1) point lookups and don't need ordering.

We then benchmark them against each other on three workloads (insert, point lookup, range scan) and explain why the numbers come out the way they do.

This lab does not implement a B-Tree or a B+-Tree — those are on-disk structures and arrive in db-10. The lesson here is why a B-Tree dominates on disk even though a skip list or hash table is faster in RAM.

2. Why It Matters

Every database has a critical-path data structure for "find this key":

SystemStructureWhy
LevelDB / RocksDB MemTableSkip listLock-free reads, ordered iteration for flush to SSTable
Redis sorted-set (ZSET)Skip list + hash tableSkip list for ranked access, hash for O(1) by-key
Memcached, Java HashMapOpen-addressing hash tableUnordered, point lookup only
SQLite, PostgreSQL, InnoDBB+-TreeOn-disk: minimize page reads
Cassandra, ScyllaDB MemTableSkip listSame reasoning as LevelDB
Lucene postingsSkip list + delta-encoded arraysRange scans over sorted doc IDs

If you pick the wrong structure you don't get "a little slower" — you get "100× slower" or "we run out of memory at 100M keys." The cost model for an in-memory structure (cycles, cache misses) is not the cost model for an on-disk one (page reads, sync latency), and a structure tuned for one will lose badly in the other domain.

3. How It Works

Skip list

A skip list is a stack of linked lists. The bottom list contains every key in sorted order. Each higher list contains a random subset of the keys below it, sampled with probability p (we use p = 0.5). To find a key, you walk right on the highest level until you'd overshoot, then drop down a level, repeat.

level 3:  HEAD ────────────────────────────────────────────────►  NIL
              │                                                  │
level 2:  HEAD ────────► [13] ───────────────► [42] ────────────► NIL
              │           │                     │                │
level 1:  HEAD ──► [7] ─► [13] ─► [21] ───────► [42] ─► [55] ──► NIL
              │     │     │       │             │       │       │
level 0:  HEAD ► [3]►[7]►[13]►[21]►[34]►[39]►[42]►[51]►[55] ──► NIL

Searching for 39 from the top:

  1. L3: HEAD → NIL (overshoot from HEAD), drop to L2
  2. L2: HEAD → 13 → 42? 42 > 39, drop to L1
  3. L1: 13 → 21 → 42? 42 > 39, drop to L0
  4. L0: 21 → 34 → 39. Found.

Expected time: O(log n) with constant factor 1/p · log_{1/p}(n). With p = 0.5 that's 2 · log₂ n comparisons.

Hash table (open addressing, Robin Hood)

An array of 2^k slots. Hash the key, mod by table size, that's the home slot. If occupied by a different key, probe linearly (slot+1, slot+2, ...). Robin Hood twist: when you probe past slot i for the d-th time, and the resident at slot i has been probed only d' < d times, swap them — the "rich" entry gets displaced by the "poor" one. This bounds the worst-case probe distance to roughly the mean.

home(K1)=2   home(K2)=2   home(K3)=4
  hash before insert:
  ┌──┬──┬────┬────┬────┬──┬──┬──┐
  │  │  │ K1 │ K2 │ K3 │  │  │  │
  └──┴──┴────┴────┴────┴──┴──┴──┘
   0  1   2    3    4   5  6  7
  (K1 home=2 dist=0, K2 home=2 dist=1, K3 home=4 dist=0)

  insert K4 with home=2:
   probe slot 2 (K1, dist 0); K4 dist=0; equal — keep going
   probe slot 3 (K2, dist 1); K4 dist=1; equal — keep going
   probe slot 4 (K3, dist 0); K4 dist=2 > 0 — STEAL, K3 displaced
   probe slot 5 (empty); place K3 with dist=1
  result:
  ┌──┬──┬────┬────┬────┬────┬──┬──┐
  │  │  │ K1 │ K2 │ K4 │ K3 │  │  │
  └──┴──┴────┴────┴────┴────┴──┴──┘

Loads up to ~0.9 work well with Robin Hood; we resize at 0.85 (× 2 capacity, rehash all).

When in-memory and on-disk diverge

A skip-list node holding a 16-byte key + 16-byte value + 4 forward pointers is ~64 bytes in memory. The same record packed into a B+-Tree leaf page (4 KiB page, no per-record pointers) is ~36 bytes — no level header, no forward-pointer array. And the B+-Tree co-locates ~100 records in one page, so a range scan of 100 keys is 1 page read instead of 100 random reads.

On disk:

  • A pointer is an 8-byte offset that triggers a page read (~100 µs cold).
  • A cache miss is ~100× a cache hit.
  • A page is 4 KiB whether you read 1 byte or 4096.

Therefore on-disk structures want high fan-out, low height, contiguous siblings, no random pointer chasing. Skip lists violate all four; B+-Trees satisfy all four.

4. Core Terminology

TermDefinition
Skip listProbabilistic ordered map: stack of linked lists with geometric level distribution
LevelIndex of a forward-pointer array in a skip-list node (0 = bottom, dense; higher = sparser)
pPer-level promotion probability (we use 0.5)
Sentinel headDummy node with the maximum possible level; all searches start here
Open addressingCollision resolution: probe other slots in the same array (vs chaining into a list)
Linear probingProbe sequence is h, h+1, h+2, … (best cache behavior)
Robin HoodOn insert, displace any resident whose probe distance is smaller than the newcomer's
Probe distance / PSLSlots between a key's home slot and its actual slot (probe sequence length)
Load factorlen / capacity
TombstoneSentinel marking a deleted slot so probes don't short-circuit (we use backward-shift deletion instead)
Backward-shift deletionAfter deleting a slot, shift the following non-home entries left by one; avoids tombstones
Cache line64 bytes on x86_64 / Apple Silicon; the unit the CPU fetches from RAM
Pointer chasingA traversal whose next address depends on the byte just loaded; CPU cannot prefetch
Fan-outNumber of children per internal node in a tree; B+-Trees aim for hundreds

5. Mental Models

"A skip list is a binary search you can mutate cheaply"

A balanced BST and a skip list have the same asymptotic complexity. The skip list wins because it has no rebalancing: no rotations, no recoloring. Each insert is one geometric coin flip + N forward-pointer writes (N = node height, ≈ log₂ n expected). This makes it much easier to make concurrent — LevelDB's MemTable allows lock-free reads while a writer inserts, because a partially-published node is invisible until the bottom-level pointer is CAS'd in.

"A hash table is a sparse array you pretend is dense"

If keys were integers in [0, N) you'd use an array. A hash function fakes that: it maps any key into [0, capacity). Collisions are the cost of the lie. Robin Hood specifically equalizes the cost of the lie across keys, so the worst-case lookup is close to the average.

"Cost models differ by 5 orders of magnitude"

L1 cache hit                        ~1 ns
Main memory                       ~100 ns
NVMe SSD random read (cold)       ~100 µs   ← 1000× RAM
HDD seek                           ~10 ms   ← 100× SSD
Cross-DC round trip                ~50 ms

A "fast" in-memory structure becomes irrelevant if it issues 10 page reads where a B+-Tree issues 1. The B+-Tree's "slow" O(log_B n) with B = 256 beats the skip list's "fast" O(log₂ n) on disk by a factor of log₂(256) = 8 — every level you avoid saves 100 µs.

6. Common Misconceptions

  • "Skip lists are slow because they're probabilistic." No — the expected and with-high-probability bounds are tight. The variance is small for any list above ~1000 keys. Failure modes are bad RNG seed (we use a deterministic xorshift here) and adversarial key insertion patterns (irrelevant for hashed keys; mitigated by per-instance seed).
  • "Hash tables have O(1) worst case." Average, not worst. A pathological hash function or adversarial keys produce O(n) chains. Robin Hood mitigates variance but does not change the worst case.
  • "You should always use a hash table when you don't need ordering." Two cases where skip lists or trees win even unordered: (a) you need iteration in any deterministic order across runs; (b) memory is tight and you can't afford the 15–40% slack of a hash table at safe load factors.
  • "Open addressing wastes memory because of empty slots." Chaining wastes more in practice: every chain node is a heap allocation with a header (malloc arenas + pointer + next ptr ≈ 32 bytes overhead per entry). Linear-probing hash tables with 70% load factor still use less memory than std::unordered_map.
  • "A B-Tree is just a balanced BST with more children." No: B+-Trees keep all data in leaves and chain leaves with sibling pointers, making range scans O(1) per page after the initial descent.
  • "std::unordered_map / Go map / Python dict are the gold standard." They're general-purpose. Specialized hash tables (Abseil's flat_hash_map, Rust's hashbrown, F14) beat them by 2–5× on most workloads. Database authors often roll their own.

7. Interview Talking Points

  • "Why does LevelDB use a skip list for the MemTable instead of a red-black tree?" → Lockless reads via single-pointer CAS publish; no rotations; easier to implement correctly.
  • "Why isn't a hash table good enough for a MemTable?" → MemTable flushes to an SSTable, which is a sorted file. A hash table would require sorting at flush time (O(n log n)); a skip list is already sorted, so flush is O(n) sequential.
  • "When would you use chaining vs open addressing?" → Open addressing for small fixed-size values (better cache); chaining when values are large and you want pointer stability across resizes.
  • "What's the cost model on disk that breaks skip lists?" → Each level traversal is a potential page read. With log₂ n levels you pay log₂ n × 100 µs. A B+-Tree with fan-out 256 has log_256 n levels, so 3 page reads for 16 M keys vs 24.
  • "Why is Robin Hood probing useful?" → Bounds variance: the maximum probe distance grows as O(log n) w.h.p., and lookups for missing keys become almost as fast as hits because you can stop as soon as you see a slot with smaller PSL than yours.
  • "What's the alternative to tombstones?" → Backward-shift deletion: walk forward from the deleted slot, shift each non-home entry left until you hit an empty slot or a home-slotted entry. O(probe-length) per delete, no tombstone bookkeeping.

8. Connections to Other Labs

  • db-01 Storage Primitives — the page abstraction the disk structures use.
  • db-03 WAL — the WAL is appended before the MemTable insert; failure recovery rebuilds the MemTable by replaying the WAL.
  • db-04 Bloom Filters — sits in front of the SSTable; same family of probabilistic in-memory structures.
  • db-05 LSM MemTable — uses the skip list from this lab, adds a snapshot / immutable flip.
  • db-10 B-Tree Fundamentals — contrast with this lab; same problem, different cost model.
  • db-21 Storage Engine Advanced — concurrent skip list (CAS publish), concurrent hash table (extendible hashing, lock striping).

References — db-02 Data Structures for Storage

Skip lists

  • Pugh, W. (1990). "Skip Lists: A Probabilistic Alternative to Balanced Trees." CACM 33(6). The original paper; six pages, very readable.
    https://www.cs.umd.edu/~pugh/galileo/papers/CACM_Skiplist_1990.pdf
  • LevelDB MemTable source — skip list with a single allocator arena.
    https://github.com/google/leveldb/blob/main/db/skiplist.h
  • RocksDB InlineSkipList — production skip list with per-node tail allocation.
    https://github.com/facebook/rocksdb/blob/main/memtable/inlineskiplist.h
  • Redis t_zset.c — skip list with per-node span field for O(log n) rank queries.
    https://github.com/redis/redis/blob/unstable/src/t_zset.c

Hash tables

  • Celis, P., Larson, P.-Å., Munro, J. I. (1985). "Robin Hood Hashing." FOCS '85. Original paper.
  • Pedro Celis's thesis on Robin Hood hashing (1986). The probe-distance analysis is here.
    https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf
  • Emmanuel Goossaert's deep dive — accessible and runnable.
    https://codecapsule.com/2013/11/11/robin-hood-hashing/
  • Google Abseil flat_hash_map — SIMD probing on top of open addressing.
    https://abseil.io/about/design/swisstables
  • Rust hashbrown — port of Abseil's SwissTable; the implementation behind std::collections::HashMap.
    https://github.com/rust-lang/hashbrown

Cost models & cache

  • Drepper, U. (2007). "What Every Programmer Should Know About Memory."
    https://akkadia.org/drepper/cpumemory.pdf
  • Jeff Dean's "Numbers Every Programmer Should Know" — the canonical latency hierarchy.
    https://gist.github.com/jboner/2841832
  • Pavlo, A. "Database Storage I" (CMU 15-445). The disk-vs-RAM cost-model lecture.
    https://15445.courses.cs.cmu.edu/fall2023/slides/03-storage1.pdf

Tree structures (preview for db-10)

  • Bayer, R., McCreight, E. (1972). "Organization and Maintenance of Large Ordered Indices." Acta Informatica. The original B-Tree paper.
  • Comer, D. (1979). "The Ubiquitous B-Tree." ACM Computing Surveys. The canonical survey.

Background

  • Sedgewick & Wayne, Algorithms 4th ed.
  • Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS) 3rd ed. — Ch. 11 (hash tables), Ch. 17 (amortized analysis).

Analysis — Design Decisions

Every choice here is reversible in code but irreversible in performance: change one, all the others bend with it.

D1. Skip list over balanced BST (red-black, AVL)

Skip listRed-black tree
Insert / lookupO(log n) expectedO(log n) worst case
Implementation LOC~120 (Rust)~400+
Concurrent readsLock-free with seqlock or single-CAS publishRequires hand-over-hand locking
Cache localityPoor (pointer chasing)Poor (pointer chasing)
Worst-case boundw.h.p., not absoluteAbsolute

Choice: skip list. The simplicity (no rotations, no rebalancing, no parent pointers) is the value proposition. Worst-case absolute bound is irrelevant when we control the hash that feeds keys in.

When you'd flip: real-time systems where a probabilistic O(n) blowup is unacceptable. Database MemTables don't qualify.

D2. Hash table: open addressing + Robin Hood, not chaining

Open addressing (linear)Chaining
Memory per entry(1/loadfactor) × sizeof(slot)sizeof(entry) + sizeof(ptr) + malloc overhead
Cache misses per lookup0–1 typically1–3 typically
Pointer stability across resizeNoYes
DeletionBackward-shift or tombstoneFree a list node
Tail latency at high loadSpikes near 1.0Degrades gracefully

Choice: open addressing, linear probing, Robin Hood. The cache-miss saving is decisive at small/medium values, which is the database use case.

When you'd flip: large values (≥1 KiB) where you want pointer stability so external references survive resize.

D3. p = 0.5 for skip list, not p = 0.25

The expected number of comparisons per search is (1/p) · log_{1/p}(n). Minimizing this gives p = 1/e ≈ 0.37. In practice:

pExpected comparisons (n=1M)LevelsMemory per node
0.25~16~101.33 forward ptrs avg
0.5~20~202 forward ptrs avg
1/e~14~131.58 forward ptrs avg

We pick p = 0.5 because (a) bit-shift sampling (count trailing zeros of a random u64) is one instruction, and (b) the code stays trivial. The 30% theoretical improvement from p=1/e is not worth the table-lookup or log math.

D4. Max level = 32

A skip list with p=0.5 and n entries has expected max level log₂ n. At max level 32 we support n = 2^32 ≈ 4 G entries before quality degrades. Going higher costs a forward-pointer slot per node forever (8 bytes per extra level). Going lower caps the structure size.

D5. Hash function: FNV-1a 64-bit

We need a hash that is:

  • High quality enough for non-adversarial keys (passes basic distribution tests)
  • Fast for small keys (~10 cycles per 8 bytes)
  • Identical in all three languages so cross-language tests can compare counts

FNV-1a is 6 lines of code, deterministic, and produces nearly the same probe-distance distribution as xxHash3 for keys ≤ 32 bytes. We use it because we control the input keys in this lab; in production you'd switch to xxHash3 or SipHash.

When you'd flip: keys controlled by adversaries → SipHash with a per-instance random key.

D6. Load factor 0.85, grow ×2

Robin Hood handles load factor up to ~0.9 well; beyond that the variance of probe distance blows up. We resize at 0.85 to leave headroom and double the capacity (and rehash). Cost of resize is amortized O(1) per insert.

D7. Backward-shift deletion, no tombstones

Tombstones simplify code but bloat the table over time — a "delete-then-insert" workload fills the table with markers and forces a resize. Backward-shift deletion costs O(PSL) per delete (typically <5 slots) and keeps the table compact. The implementation walks forward from the deleted slot, moves each entry one slot left until reaching an empty slot or an entry whose PSL is 0.

D8. Seed: deterministic per-instance, random across instances

The skip list RNG must not be deterministically the same across runs. But within one process run we want reproducible behavior for debugging. The CLI accepts an optional seed; tests pass a fixed value.

Cost-model cheat sheet

OperationLatency
L1 cache hit~1 ns
L2 cache hit~4 ns
L3 cache hit~15 ns
Main memory~100 ns
4 KiB page from SSD (cold)~100 µs
4 KiB page from HDD~10 ms

Every random pointer dereference is a potential L3 miss → 100 ns. A skip list with 20 levels traverses 20 such pointers in the worst case = 2 µs. A linear-probing hash table touches 1–2 cache lines = ~10 ns. Hash table is ~100× faster for point lookups in RAM, and the gap grows as the working set leaves L3.

What breaks at scale

SymptomCauseMitigation
Skip-list lookup tail latency 10× worse than medianBad RNG sequence; node height varianceUse a higher-quality PRNG; bound max height
Hash table tail latency spikes near 90% loadRobin Hood variance explodesResize earlier (load factor 0.75)
Skip-list memory 2× of equivalent BSTForward-pointer array overheadUse per-level arena allocators; pack pointer arrays
Hash table grows but never shrinksResize is unidirectional in our implShrink when load < 0.25 (we don't — extension point)
Iterator skips entriesMutation during iterationSnapshot at iterator-creation time (db-05 covers this)

Execution — How to Build and Run

Quick start (per language)

# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/dsbench --help

# Go
cd src/go
go test ./...
go build -o bin/dsbench ./cmd/dsbench
./bin/dsbench --help

# C++
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build
./build/dsbench --help

CLI: dsbench

A single binary per language that exercises both data structures.

SubcommandArgsWhat it does
skiplist insert N [seed]N (count)Inserts N keys, prints final size + max level + histogram
skiplist roundtrip NNInserts N keys, verifies every key reads back, then removes them
skiplist iter NNInserts N random keys, prints all keys in iterator order
hashtable insert NNInserts N keys, reports load factor + max probe distance + histogram
hashtable roundtrip NNInsert + verify + delete + verify gone
bench point NNInserts N keys into both, benchmarks point lookups for each
bench mem NNReports bytes-per-entry for both structures

Library API

Same shape in all three languages.

SkipList::new(seed)            -> SkipList
SkipList::insert(key, value)   -> bool     (true if newly inserted, false if replaced)
SkipList::get(key)             -> Option<value>
SkipList::remove(key)          -> bool
SkipList::len()                -> usize
SkipList::iter()               -> sorted iterator over (key, value)

HashTable::new(capacity)       -> HashTable
HashTable::insert(key, value)  -> bool
HashTable::get(key)            -> Option<value>
HashTable::remove(key)         -> bool
HashTable::len()               -> usize
HashTable::load_factor()       -> f64
HashTable::max_probe()         -> usize

Keys and values are byte strings.

Verifying

./scripts/verify.sh        # invariants per structure
./scripts/cross_test.sh    # cross-language behavioral checks

Observation — Looking Inside the Structures

What you should see

This lab is at its best when you stop trusting the numbers and start looking at the memory layout.

1. Histogram of skip-list node heights

The skip list with p=0.5 should produce a geometric distribution: ~half the nodes at level 0 only, ~quarter reaching level 1, etc.

./target/release/dsbench skiplist insert 100000

Sample output:

level   count    %
    0  50032   50.0   ████████████████████████████████████████████████
    1  25021   25.0   █████████████████████████
    2  12508   12.5   █████████████
    3   6234    6.2   ██████
    4   3098    3.1   ███
    5   1581    1.6   ██
    6    778    0.8   █
    ...
   max level used = 16

If your distribution is skewed (e.g., level 0 is 25% instead of 50%) your RNG or sampling code is wrong. The most common bug is rng() & 1 evaluated once and reused.

2. Hash-table probe-distance histogram

./target/release/dsbench hashtable insert 1000000

Sample output (Robin Hood, load 0.477):

probe distance   count       %
            0   633412     63.3
            1   235108     23.5
            2    87412      8.7
            3    32104      3.2
            4    10001      1.0
            5     1652      0.2
            6      298      0.0
            7       12      0.0
            8        1      0.0
   mean = 0.55   max = 8   capacity = 2097152   load = 0.477

With pure linear probing (no Robin Hood) the tail extends much further.

3. Memory accounting

./target/release/dsbench bench mem 1000000

Sample:

skip list  : 1,000,000 entries, ~80 B/entry
hash table : 1,000,000 entries, ~25 B/entry  (cap=2097152, load=0.477)

What "working" looks like

  • Skip list: max level grows like log₂ n + 5 (slight overshoot from variance).
  • Hash table: mean probe < 1, max probe < 10 at load ≤ 0.85.
  • Bench: hash table 5–20× faster than skip list on point ops.
  • Memory: hash table is 2–4× smaller per entry.

What "broken" looks like

  • Mean probe distance climbs above 2 → poor hash function or table not actually power-of-two-sized.
  • Max skip-list level stuck at 1–2 with 1M entries → RNG broken; bit-test always falls through.
  • Same level distribution from one run to the next → seed not random.
  • Hash table size doesn't grow after 85% load → resize trigger not firing.
  • max_probe 3–4× above the theoretical bound → almost always the hash function. We hit this with raw FNV-1a 64-bit (max_probe ≈ 200 at N=100k vs expected ≤ 66). Adding a SplitMix64 finalizer fixed it. The pure-FNV variant typoed its prime constant too — see step 02 for the canonical value 0x00000100000001b3 and the three pinned vectors ("", "a", "foobar").

What scripts/cross_test.sh proves

Runs dsbench skiplist iter N seed in Go, Rust, and C++ and diffs all three outputs. If they aren't byte-identical, one of the three has drifted on hash function, RNG, or ordering — usually the easiest single signal for catching a port regression.

Verification — Invariants

scripts/verify.sh runs the language-default binary (Rust by default) through these checks. scripts/cross_test.sh then re-runs the same scenarios in Go and C++ and asserts the behaviorally observable outputs match. The internal layouts are not required to match — only the API behavior.

#InvariantHow verified
V1Skip list round-trip: insert(k, v) then get(k) == v for all kdsbench skiplist roundtrip 10000
V2Skip list iteration is in sorted orderdsbench skiplist iter 1000 piped to sort -c
V3Skip list level distribution is geometric (p=0.5 ± tolerance)histogram chi-square check in unit test
V4Skip list max level stays ≤ MAX_LEVEL even with 100k insertsbench reports max_level_used
V5Hash table round-trip: insert(k, v) then get(k) == v for all kdsbench hashtable roundtrip 10000
V6Hash table max probe distance ≤ 4 × log₂(cap × load) at load ≤ 0.85unit test asserts
V7Hash table resizes at load 0.85, capacity doubles, max-probe dropsunit test
V8Backward-shift deletion never leaves a lookup holeunit test: insert 100, delete random 50, assert remaining 50 still found
V9Insert with same key twice replaces value, len() unchangedunit test
V10Cross-language: insert sequence [5, 1, 3, 8, 2] into skip list, iter output is sorted in all 3 langscross_test.sh
V11Cross-language: hash table after inserting same 1000 keys reports same len()cross_test.sh
V12Cross-language: roundtrip of the same N keys works in all 3 langscross_test.sh

Running

./scripts/verify.sh          # ~5s, runs Rust binary
DSE_BIN=./src/go/bin/dsbench ./scripts/verify.sh
DSE_BIN=./src/cpp/build/dsbench ./scripts/verify.sh

./scripts/cross_test.sh      # builds all 3, runs cross-language checks

What to do when a check fails

FailureMost likely cause
V1, V5Off-by-one in insert path; key not normalized
V2Skip list level 0 chain has out-of-order pointer write
V3RNG broken: same bit pattern reused per call
V4Level cap not enforced
V6Hash table not actually power-of-two; home_slot = hash % cap not masking
V7Load-factor check uses > instead of >=, or len not decremented on delete
V8Tombstones left in array; backward-shift loop terminates too early
V11Hash function differs across langs — must be FNV-1a in all three

Broader Ideas — Where to Go Next

Extensions you can build on top of this lab. Each is a 0.5–2 day exercise.

1. Concurrent skip list with lock-free reads

LevelDB's MemTable allows concurrent readers while a single writer inserts. The trick: nodes are made visible by a single atomic store of the bottom-level forward pointer — once that store lands, the node exists; before, it doesn't. Higher-level pointers can race because they're only used to speed up a search; if they point to a not-yet-visible node, the next compare won't match and the search retries.

Implement: an AtomicPtr per forward pointer, a single writer (enforced by external mutex or compare_exchange), no per-node lock. Test: spawn 8 readers + 1 writer, run for 10s, assert no reader observes a partial node.

2. Concurrent hash table: lock striping + extendible hashing

Lock striping: 64 stripe mutexes; key's stripe = hash & 63. A write locks its stripe; reads either lock-read or use seqlock counters.

Extendible hashing: instead of full-table resize, split one bucket at a time when it grows past a threshold.

3. ART (Adaptive Radix Tree)

A radix tree variant that uses 4 different internal node layouts (4, 16, 48, 256 children) and adapts based on density. Wins for variable-length keys with shared prefixes (URLs, paths). [Leis et al., ICDE 2013].

4. Cuckoo hashing

Two hash functions, two candidate slots per key. Lookups are guaranteed 2 reads. Used in Memcached extensions.

5. Hopscotch hashing

Each entry must live within H slots of its home (typically H=32). Bounded probe distance like Robin Hood with stronger guarantee.

6. B+-Tree (preview db-10)

Write an in-memory B+-Tree with fan-out 64, leaves chained, and compare to skip list on range scans. This sets up the "why on-disk B-Trees beat skip lists" intuition before you've touched disk.

7. Skip list with rank queries (Redis ZRANK)

Add a span field per forward pointer = "number of bottom-level nodes this pointer skips over." Now rank(key) is O(log n) instead of O(n). ~50 LOC of additions.

8. Bloom filter (preview db-04)

In front of the hash table: a 1-bit-per-position array sized for ~1% false-positive rate at expected N. Measure: at 50%-miss workloads the Bloom filter saves you cache misses; at 95%-hit workloads it's pure overhead.

9. xor / cuckoo / ribbon filters

Modern variants (xor [Graf & Lemire 2020], ribbon [Dillinger 2021]) get the same false-positive rate as Bloom with 25–35% less memory.

10. Cache-conscious skip list

Replace the per-node forward-pointer array with a contiguous tail allocation (RocksDB's InlineSkipList). Compare cache miss rates: same algorithm, half the misses.

11. Persistent / immutable variants

Build an immutable skip list where insert returns a new root, sharing 99% of nodes with the old one. Useful for snapshots, MVCC.


When you've explored a couple, you're ready for db-03 Write-Ahead Log, where the durability story begins.

Step 1 — Implement the Skip List

Goal

Build a sorted map with O(log n) expected insert/lookup/remove and O(n) ordered iteration.

API

SkipList::new(seed: u64) -> SkipList
SkipList::insert(key, value) -> bool   // false if replaced existing
SkipList::get(key) -> Option<value>
SkipList::remove(key) -> bool
SkipList::len() -> usize
SkipList::iter() -> iterator<(key, value)>
SkipList::max_level_used() -> usize
SkipList::level_histogram() -> [usize; MAX_LEVEL]

Constants

  • MAX_LEVEL = 32
  • P = 0.5 (sample via count_trailing_zeros(rng()) % MAX_LEVEL)

Data layout

SkipList {
    head:    Node*       // dummy sentinel at MAX_LEVEL
    level:   usize       // current max level used (1..=MAX_LEVEL)
    len:     usize
    rng:     u64         // xorshift64 state
}

Node {
    key:      Vec<u8>
    value:    Vec<u8>
    forward:  Vec<Option<Box<Node>>>   // length = height of this node
}

In Rust we use Box<Node> for ownership and raw pointers for siblings (or Option<NonNull<Node>> for safer raw pointers). In Go, *Node is the natural choice. In C++, std::unique_ptr<Node> for the sole owner of level-0 next, raw pointers for higher levels.

For simplicity we use a single ownership style: level-0 owns nodes; higher levels hold raw pointers. Drop walks the level-0 chain once.

Insert pseudocode

update[0..MAX_LEVEL] = HEAD
x = head
for i in (level-1)..=0:
    while x.forward[i] != null && x.forward[i].key < key:
        x = x.forward[i]
    update[i] = x

if x.forward[0] != null && x.forward[0].key == key:
    x.forward[0].value = value
    return false                       // replaced

new_level = random_level()
if new_level > level:
    for i in level..new_level: update[i] = HEAD
    level = new_level

node = new Node(key, value, new_level)
for i in 0..new_level:
    node.forward[i] = update[i].forward[i]
    update[i].forward[i] = node
len += 1
return true

Random level

fn random_level(rng: &mut u64) -> usize {
    *rng ^= *rng << 13;
    *rng ^= *rng >> 7;
    *rng ^= *rng << 17;
    let lvl = (rng.trailing_zeros() as usize) + 1;
    min(lvl, MAX_LEVEL)
}

Tests

#TestPass if
T1insert 1000 random keys, all get succeedevery value matches
T2insert sorted keys 0..999, iter yields 0..999strictly increasing
T3insert + remove all keys, len = 0empty
T4insert with same key twice, len unchangedreplacement worked
T5level distribution at N=100k is geometricsum of L≥k slots ≈ N · 2^-k

Step 2 — Implement the Hash Table

Goal

Open-addressing hash table with linear probing + Robin Hood + backward-shift deletion.

API

HashTable::new(initial_capacity_pow2: usize) -> HashTable
HashTable::insert(key, value) -> bool          // false if replaced
HashTable::get(key) -> Option<value>
HashTable::remove(key) -> bool
HashTable::len() -> usize
HashTable::capacity() -> usize
HashTable::load_factor() -> f64
HashTable::max_probe() -> usize
HashTable::probe_histogram() -> Vec<usize>

Hash function

FNV-1a 64-bit followed by a SplitMix64 finalizer (identical in all three languages):

offset = 0xcbf29ce484222325
prime  = 0x00000100000001b3        # = 1_099_511_628_211
h = offset
for byte in key:
    h ^= byte
    h = h * prime  (wrapping)
return splitmix64_finalize(h)

fn splitmix64_finalize(mut h: u64) -> u64 {
    h ^= h >> 30; h = h.wrapping_mul(0xbf58476d1ce4e5b9);
    h ^= h >> 27; h = h.wrapping_mul(0x94d049bb133111eb);
    h ^= h >> 31;
    h
}

Plain FNV-1a has notoriously poor avalanche on short, sequential keys — running it raw against Robin Hood probing blows up max_probe (we measured 206 vs the expected ≲66 bound at N=100k). The SplitMix64 finalizer is bijective (adds no collisions) and re-mixes the high bits down, which restores the geometric PSL distribution.

Known-answer vectors (pin these in tests across all three languages):

keyhash
""0xf52a15e9a9b5e89b
"a"0x02c0bdbf481420f8
"foobar"0x404da9e3b74078c2

Slot layout

Slot {
    occupied: bool        // or use psl = MAX as sentinel
    psl:      u32         // probe sequence length
    hash:     u64
    key:      Vec<u8>
    value:    Vec<u8>
}

We store the full 64-bit hash inside each slot so resizes don't re-hash keys, and so that get can compare hashes (cheap) before keys (expensive).

Insert (Robin Hood)

if (len + 1) / capacity > 0.85: resize(capacity * 2)

h = hash(key)
i = h & (capacity - 1)
psl = 0
loop:
    if !slots[i].occupied:
        slots[i] = (key, value, h, psl); occupied; len += 1
        return true
    if slots[i].hash == h && slots[i].key == key:
        slots[i].value = value
        return false                 // replaced
    if slots[i].psl < psl:
        swap(slots[i], (key, value, h, psl))   // steal
    i = (i + 1) & (capacity - 1)
    psl += 1

Get

h = hash(key)
i = h & (capacity - 1)
psl = 0
loop:
    if !slots[i].occupied: return None
    if slots[i].psl < psl: return None      // would have stolen
    if slots[i].hash == h && slots[i].key == key:
        return Some(slots[i].value)
    i = (i + 1) & (capacity - 1)
    psl += 1

Remove (backward-shift)

i = find_slot(key)
if not found: return false
loop:
    j = (i + 1) & (capacity - 1)
    if !slots[j].occupied || slots[j].psl == 0:
        slots[i].occupied = false
        len -= 1
        return true
    slots[i] = slots[j]
    slots[i].psl -= 1
    i = j

Resize

new_slots = [empty; capacity * 2]
old = swap(slots, new_slots); capacity *= 2; len = 0
for slot in old where occupied:
    insert(slot.key, slot.value)   // re-uses hash if you cache it

Tests

#TestPass if
T1insert + get of 10k random keysall hits
T2insert 10k, remove 5k random, get remaining 5kall still found, removed not found
T3insert past 85% load triggers resizecapacity doubled
T4duplicate insert replaces value, len unchangedreplacement worked
T5max PSL ≤ 4·log₂(cap·load) at 1M insertsbounded variance

Step 3 — Benchmark and Compare

Goal

Quantify the cost difference between the two structures on three workloads.

Workloads

NameOpMeasured
pointget(key) where key was previously insertedns/op + cache misses (if perf)
point-missget(key) where key is absentns/op
rangeiterate from lo until hins/key (skip list only)

Procedure

  1. Seed both structures with N (10k, 100k, 1M, 10M) random 8-byte keys + 8-byte values.
  2. For each workload, take iters random accesses, time the loop, divide by iters.
  3. Report p50, p99, mean, total throughput.

Expected outcomes (M2 Pro, release, N=1M)

OpSkip listHash tableRatio
Insert~700 ns~95 ns
Point hit~450 ns~25 ns18×
Point miss~500 ns~30 ns17×
Range scan 1000 from key~25 µsN/A
Bytes/entry~80~253.2×

What the numbers prove

  • For unordered point access, never use a skip list. The factor-of-20 gap is from cache-miss count: hash table touches ~1 line, skip list touches ~20 (one per level).
  • For ordered access, the skip list is the only option of the two. Range scans on a hash table require collecting all entries and sorting — O(n log n) setup vs O(k) for the skip list.
  • The memory gap is real and gets worse for tiny values. Skip-list forward-pointer arrays dominate when value size is < ~64 B.

How to run

./target/release/dsbench bench point 1000000
./target/release/dsbench bench mem   1000000

Optional: cache-miss measurement

# Linux
perf stat -e cache-misses,cache-references ./target/release/dsbench bench point 1000000

# macOS (Instruments → Counters template, capture by PID)
xcrun xctrace record --template 'Counters' --launch ./target/release/dsbench bench point 1000000

Write-Ahead Log (WAL)

Status: complete — runnable in Rust, Go, C++.

1. What Is It

A WAL is an append-only file that records intent-to-modify before the actual data pages are updated. On crash, the recovery routine replays the WAL from the last checkpoint, restoring the database to a consistent state.

client write  ──►  append record  ──►  fsync(WAL)  ──►  ack client
                                                          │
                                                          ▼
                                                  later: apply to data file
                                                  later: checkpoint & truncate WAL

The critical invariant is the write order: WAL hits stable storage before the client is told the write committed. If the process dies between the fsync and the data-page update, recovery re-applies the logged operation. If it dies before the fsync, the client never got an ack, so losing the record is acceptable.

2. Why It Matters

Without WALWith WAL
Random in-place writes to the data fileSequential append (10–100× faster on HDD, still better on SSD)
Each commit = random-page fsyncEach commit = single sequential append + fsync
Crash mid-update ⇒ torn page, corrupt fileCrash ⇒ replay log, idempotent recovery
Group commit impossibleMultiple commits batched into one fsync ("group commit")

Every serious database has one: PostgreSQL's WAL, MySQL's redo log, SQLite's WAL mode, RocksDB's WAL, LevelDB's LOG file, Kafka's segments.

3. How It Works

Record framing used in this lab (mirrors LevelDB's simplified record format minus the block-grouping):

 ┌─────────┬─────────┬──────────────────────┐
 │ len(u32)│ crc(u32)│ payload (len bytes)  │
 └─────────┴─────────┴──────────────────────┘
       4         4              N
  • len and crc are little-endian.
  • crc is CRC-32 (IEEE 802.3 polynomial, reflected) of the payload only.
  • Records are written back-to-back with no padding.
  • Recovery iterates from offset 0 and stops at the first record whose header is short, whose payload is short, or whose CRC fails. The valid prefix is replayed; the bad tail is silently truncated on next open.
file:
   ┌───┬───┬─────┬───┬───┬─────┬───┬───┬──┐  ← crashed mid-write
   │L=8│CRC│ A…  │L=4│CRC│ B…  │L=9│CRC│??│
   └───┴───┴─────┴───┴───┴─────┴───┴───┴──┘
                                ▲          ▲
                                │          └─ short payload → stop, truncate from here
                                └─ last fully-flushed record

4. Core Terminology

TermDefinition
RecordSelf-describing unit: header (len + crc) followed by payload bytes.
fsyncSyscall asking the kernel to flush dirty pages and inode metadata to disk.
fdatasyncLike fsync but skips metadata if only data changed. Slightly faster.
Group commitCoalescing multiple in-flight appends into one shared fsync.
Torn writeA write the device split into two physical sectors, only one of which made it. CRC catches this.
Tail truncationScanning forward at open and discarding any partial trailing record.
CheckpointFlush dirty pages to data file, record a WAL position beyond which replay is unnecessary.

5. Mental Models

  1. The log is the source of truth, the data file is the cache. Recovery reconstructs from the log.
  2. CRC is for detection, not correction. It tells you where the good prefix ends; it does not heal damage.
  3. fsync is a barrier, not a universal durability guarantee. Consumer SSDs and FUSE filesystems sometimes lie. Use fio --fdatasync=1 to spot-check hardware.
  4. Sequential I/O wins. Even on SSDs, sequential writes have better SLC-cache and GC behavior than random ones.

6. Common Misconceptions

  • "write() already put it on disk." No — kernel page cache. fsync is required for durability.
  • "CRC + length is enough." Necessary, not sufficient. A record with both len and crc zeroed is indistinguishable from len=0,crc=crc32([]). We disallow len=0 (treat as EOF).
  • "Group commit hurts latency." Tiny median bump for a 10–100× throughput win and lower tail latency under load.
  • "fsync == O_DIRECT." Different layers. O_DIRECT bypasses the page cache; fsync flushes it.

7. Interview Talking Points

  • Distinguish redo log (PostgreSQL/InnoDB), WAL mode (SQLite WAL), journal mode (SQLite default).
  • Why CRC the payload, not the header? (Need length first; header CRC catches the wrong failures.)
  • fsync vs fdatasync vs sync_file_range.
  • Group commit mechanics and tradeoff.
  • Why WAL alone doesn't beat torn writes on the data file → full-page writes in WAL after each checkpoint.

8. Connections to Other Labs

  • db-01 — every append here is a pwrite + fsync from db-01.
  • db-05, db-09 — LSM writes always hit the WAL first.
  • db-11 — SQLite WAL mode reuses this exact pattern around a B-tree pager.
  • db-13 — commit records & 2PC live in the WAL.
  • db-17 — Raft's replicated log is, mechanically, a WAL.

References — Write-Ahead Log

Papers

  • ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. C. Mohan et al., 1992. The canonical WAL paper; introduces the redo–undo discipline still used everywhere.
  • The Log-Structured Merge-Tree (LSM-Tree). P. O'Neil et al., 1996. Section 2 motivates why a separate sequential log is necessary even for in-memory writes.

Code

  • LevelDB db/log_format.h — record types & block structure that inspired this lab's framing.
  • RocksDB db/log_writer.cc — production-grade group-commit implementation.
  • PostgreSQL src/backend/access/transam/xlog.c — full-page writes, redo machinery.

CRC

  • A Painless Guide to CRC Error Detection Algorithms. Ross Williams, 1993. Plain-English walk-through of polynomial division and reflected algorithms.
  • Linux kernel lib/crc32.c — reference table-driven implementation.

Filesystems & fsync

  • Can Applications Recover from fsync Failures? Anthony Rebello et al., USENIX ATC 2020. Surveys the depressing landscape of partial fsync failures.
  • Files are hard. Dan Luu, 2017. Survey blog post on every way fsync, rename, and friends betray you.
  • man 2 fsync, man 2 fdatasync, man 2 open (O_DSYNC, O_SYNC).

Analysis — Write-Ahead Log

Problem statement

Make a stream of writes durable and crash-recoverable without paying for a random in-place fsync per write.

Constraints

ConstraintWhy it matters
Append-only fileSequential I/O, ~10–100× faster than random on HDD; better SLC/GC behavior on SSD.
Self-describing recordsRecovery must work without a side index. Header = length + checksum.
Truncation-tolerant tailCrash mid-write leaves a partial record. We must detect & ignore it on next open.
Single writerWe do not address multi-writer log multiplexing here (Kafka does).
No structural guarantees from the FSDon't assume ordering of metadata vs data, or that 4KB writes are atomic.

Design decisions

  1. Header = len(u32 LE) || crc32(u32 LE). Small (8 B), aligned, endian-fixed. We deliberately keep the CRC out of the length so we can stream-checksum the body.
  2. CRC is over the payload only. The header itself is implicitly validated by use — a corrupt len either points beyond EOF (short read) or to data whose CRC won't match.
  3. len == 0 is disallowed, used as the EOF sentinel. Empty payloads are rare in practice and avoiding the ambiguity simplifies the reader (len=0,crc=0 happens naturally in a hole of zeros from a sparse file or pre-allocated extent).
  4. Little-endian on disk. Everyone runs LE now (x86, ARM, RISC-V — even POWER prefers it). No htole32 dance saves ~5 LOC per language.
  5. CRC table generated at startup, not hardcoded. 1 KB, computed in microseconds. Easier to audit, and lets us swap polynomials in tests.
  6. One file, one writer, one fd. No segment rotation in this lab — that lives in db-07 (compaction) and db-09 (LevelDB). Single-file WAL is enough to teach framing.
  7. sync() is a separate method. The caller decides commit boundaries. Production systems may add append_sync(payload) that batches a group commit; we leave that for bench mode.

Why this design over alternatives

  • vs LevelDB's block-grouped framing: LevelDB pads records to 32KB blocks for alignment and easier corruption isolation. Beautiful, but doubles the code volume. We follow this lab's bias of "minimum to teach the concept, plus one cross-language cross-check".
  • vs JSON / protobuf framing: would require schema management. CRC + raw bytes is the smallest possible recoverable framing.
  • vs a per-record fsync: we expose a separate sync() so the user can choose between durability per-record (call after every append) and group commit (call periodically).

Failure modes addressed

FailureDetection
Partial header at EOFHeader read < 8 bytes ⇒ stop iteration.
Header OK, partial payloadPayload read < len ⇒ stop iteration.
Full record, CRC mismatch (bit-flip)CRC32 over payload ≠ stored CRC ⇒ stop iteration.
Hole of zeros (sparse FS, preallocated)len == 0 is the EOF sentinel ⇒ stop iteration cleanly.
Disk fully lying about fsyncOut of scope; mention fio --fdatasync=1 to detect.

Failure modes NOT addressed in this lab

  • Bit-flip in the header itself that produces a plausible (len, crc) pair (probability ≈ 2⁻³²). Production systems mitigate with a record-type byte (LevelDB) or magic bytes (Kafka).
  • Multi-process writers. Use O_APPEND + ≤PIPE_BUF append for that; see db-09 / db-21.
  • Disk full mid-write. Treat as torn write at EOF (the trailing record fails CRC and is dropped on recovery); the caller's append() returns an I/O error that they must handle.

Execution — How to Build and Run

Quick start (per language)

# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/walbench --help

# Go
cd src/go
go test ./...
go build -o bin/walbench ./cmd/walbench
./bin/walbench --help

# C++
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build
./build/walbench --help

CLI: walbench

A single binary per language exercising the WAL.

SubcommandArgsWhat it does
append PATH N [SIZE]path, count, payload bytes (default 64)Appends N records of SIZE bytes; reports bytes/sec
append-sync PATH N [SIZE]sameSame as append but sync() after each record
read PATHpathReplays the log, prints len(crc=…) ok per record, then OK n=… bytes=…
corrupt PATH OFFSET BYTEpath, offset, byte valueOverwrites one byte in place — for testing tail tolerance
bench-group PATH N BATCHpath, total records, batch sizeGroup-commit benchmark: sync once per BATCH records

Library API

Same shape in all three languages.

Wal::open(path)            -> Wal           // creates or opens for append, scans to EOF
Wal::append(payload)       -> u64 offset    // record start offset in file
Wal::sync()                -> ()            // fdatasync (or fsync) the file
Wal::len()                 -> u64           // bytes on disk (post-append, post-sync)
Wal::close()               -> ()            // implicit on Drop in Rust / RAII in C++

Wal::iter(path)            -> iterator<Vec<u8>>   // streams records; stops at first bad/short

open scans forward at startup to (a) find true EOF after a partial-write recovery and (b) optionally truncate the file to that position so the next append doesn't append after a known-bad tail. We do truncate in this implementation — the alternative (leave the bad tail in place) makes file size useless and complicates len().

Verifying

./scripts/verify.sh        # invariants per implementation
./scripts/cross_test.sh    # write in lang A, read in lang B, all six pairs

Observation — Looking at the Bytes

1. Hexdump a freshly written WAL

./build/walbench append /tmp/wal 3 4
xxd /tmp/wal
00000000: 0400 0000 b3ca 9eb5 4141 4141 0400 0000  ........AAAA....
00000010: b3ca 9eb5 4242 4242 0400 0000 b3ca 9eb5  ....BBBB........
00000020: 4343 4343                                CCCC

What you should see:

  • Three 12-byte records (4 header + 4 payload * 3 = 36 bytes, but actually 8+4 = 12 each = 36 ✓).
  • Identical headers because every payload is "AAAA" / "BBBB" / "CCCC" — same length, different CRC.
  • 04 00 00 00 is len = 4 in little-endian.
  • The next 4 bytes are the payload's CRC, also little-endian.

If your file is suspiciously large (e.g., starts with garbage 0x00 or 0xFF runs), open() is opening the file with the wrong flags or your buffer is uninitialized.

2. Group commit vs per-record sync

./build/walbench append-sync /tmp/wal 10000 64       # fsync per record
./build/walbench bench-group  /tmp/wal 10000 64 1    # group=1, same thing
./build/walbench bench-group  /tmp/wal 10000 64 64   # 64 records per fsync
./build/walbench bench-group  /tmp/wal 10000 64 512  # 512 per fsync

Sample (M2 Pro, APFS):

mode             throughput
per-record sync     1,800 records/s   (~556 µs/sync)
group=64          110,000 records/s
group=512         260,000 records/s

Two takeaways: per-record sync is 3 orders of magnitude slower; group size has diminishing returns past ~256 because the bottleneck shifts to write() itself.

3. Tail truncation in action

./build/walbench append /tmp/wal 5 16
wc -c /tmp/wal                       # 120 bytes (5 × 24)
printf '\xff\xff\xff\xff' >> /tmp/wal
wc -c /tmp/wal                       # 124 bytes
./build/walbench read /tmp/wal       # reads 5, then "stop: short header" (124-120 = 4 < 8)
./build/walbench append /tmp/wal 1 16
wc -c /tmp/wal                       # 144 bytes — open() truncated the garbage, then appended

The reopen-truncate behavior is the most easily-missed correctness detail. If it's broken, your second append ends up inside the corrupted region and the file becomes unreadable after recovery.

4. CRC sensitivity

Bit-flipping one byte of a 64-byte payload should kill the CRC of that record but leave everything before it valid:

./build/walbench append /tmp/wal 10 64
./build/walbench corrupt /tmp/wal 100 0x00     # mid-payload of record ~4
./build/walbench read /tmp/wal | tail
# expected: prints ~3 OK records then "stop: bad crc"

What "working" looks like

  • Hexdump shows tightly packed 8-byte-header + payload pairs, no padding.
  • Group commit is at least 50× faster than per-record sync.
  • Tail truncation works on first reopen, regardless of how much garbage you appended.

What "broken" looks like

  • A reader that hangs or panics on garbage — fix the bounds checks in the iter loop.
  • File size grows but throughput is flat — you're probably calling fsync inside append accidentally.
  • CRC doesn't trip on single-bit flips — wrong polynomial (likely you used the un-reflected version, see scripts/verify.sh).
  • Cross-language test fails — endianness or CRC table bug. Print the first 16 bytes of the file from each language and compare.

Verification — What to Test and How

Property tests (per language)

#TestPass if
V1crc32_known_vectors"" → 0x00000000; "a" → 0xE8B7BE43; "123456789" → 0xCBF43926
V2roundtrip_smallappend "hello" "world", iter yields exactly those two payloads
V3roundtrip_1000_variableappend 1000 records of pseudo-random sizes 1..1024, iter yields identical sequence
V4truncated_tailopen, append A and B, fsync, write 5 bytes of garbage past EOF, reopen ⇒ iter yields {A,B} only
V5corrupt_payloadflip one bit in the payload of record 2 of 5, iter yields {1} (stops at first bad)
V6corrupt_headeroverwrite len of record 2 with 0xFFFFFFFF, iter yields {1}
V7reopen_truncates_garbagescenario V4 followed by a new append, total iter yields {A,B,C} and file size equals exactly the three records' total bytes
V8append_returns_offsetoffset returned by appendₙ equals sum of (header+payload) for records 0..n-1

Cross-language test

scripts/cross_test.sh performs a six-way matrix: for each writer ∈ {go, rust, cpp} and reader ∈ {go, rust, cpp}, write 500 records of varying sizes with a fixed seed in the writer language, read them in the reader language, assert the payload list matches exactly.

This catches:

  • Endianness mistakes in len/crc.
  • Different CRC polynomials or initial value / final XOR.
  • Off-by-one in header parse.
  • fsync not being called before the reader runs (we close the writer between phases).

Manual smoke

./build/walbench append /tmp/wal 100 64
./build/walbench read /tmp/wal | tail
# expected: OK n=100 bytes=7300
./build/walbench corrupt /tmp/wal 50 0xFF
./build/walbench read /tmp/wal | tail
# expected: stops well before record 100, reports bad record

What "passing" means

  • All 8 property tests green in all three languages.
  • cross_test.sh exits 0 (9 successful writer×reader runs).
  • Manual smoke: corruption stops the reader cleanly, no panic / no segfault, no infinite loop.

Broader Ideas — Beyond the Minimum

Things worth knowing that aren't in the lab code.

Block-grouped framing (LevelDB / RocksDB)

LevelDB pads records into 32 KB blocks and uses a 1-byte type field (FULL, FIRST, MIDDLE, LAST) to handle records that straddle blocks. Benefit: corruption in one block can't propagate; recovery can resync to the next block boundary. Cost: more code, slightly more space.

Group commit, properly

Real systems run a "log writer" goroutine/thread:

clients ──► append to buffer ──► wake writer ──► fsync once ──► broadcast cond var

The writer batches all records that arrived during the previous fsync into the next fsync. Latency stays bounded by (max fsync time) + (one batch fill); throughput scales until you saturate the SSD's IOPS.

O_DSYNC vs application-level fsync

O_DSYNC makes every write() durable before returning. Removes the need for explicit fsync, but you lose the chance to batch. Real DBs prefer explicit fsync for that reason.

sync_file_range and friends

Linux-only. sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE) flushes only a byte range. PostgreSQL uses this for "lazy" checkpoints to avoid stalling on huge fsyncs. Doesn't sync metadata, so still need a final fsync.

Pre-allocation & fallocate

For predictable I/O, pre-allocate the next WAL segment with fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE). This avoids metadata updates on each grow and gives the FS a contiguous extent. PostgreSQL pre-zeroes 16 MB segments.

Direct I/O & alignment

O_DIRECT bypasses the page cache; useful when the DB has its own buffer pool. Requires 512 B or 4 KB aligned buffers and offsets. Modern recommendation: prefer io_uring + O_DIRECT over POSIX AIO. Returns in db-21.

Mixing data files and WAL on the same disk

Bad idea for HDDs (head contention), neutral for SSDs (no head), bad for low-end SSDs (write amplification competes). Production systems put WAL on a separate device when latency-sensitive.

When the WAL is the database

LSM-trees, Kafka, NATS JetStream, Pulsar, Apache BookKeeper — these treat the log as the primary structure and let secondary indexes / merge trees / consumers catch up. The data file in our toy example was hypothetical; LSMs make it explicit. See db-05 onward.

Encryption / compression

  • Compression per record: trivial, but blocks the Vec<u8> reuse pattern. Better to compress whole segments at checkpoint time.
  • Encryption per record: AEAD (AES-GCM or ChaCha20-Poly1305) replaces CRC32 — the auth tag is your CRC. PostgreSQL's TDE proposals use this.

Replication

Once you have a sequential log of operations, replicating it is "just" send-and-replay. This is the entire conceptual basis of Raft and ZAB — see db-17 / db-19. The framing tricks here transfer directly.

What goes wrong at scale

  • fsync amplification: every fsync touches the FS journal, which serializes against other fsyncs. Solution: large group commit batches.
  • Long fsync tails: 99th-percentile fsync on a busy NVMe can be 100ms+. Solution: pipeline; never block a hot-path thread on fsync.
  • Filesystems that lie: ext4 with data=writeback may complete fsync before journaling. APFS, ZFS, btrfs each have their own quirks. Empirical test with fio is the only safe answer.

Step 1 — Record framing & CRC

Goal

Define the on-disk format and a streaming CRC32 implementation that matches between Rust, Go, and C++.

Format recap

 ┌─────────┬─────────┬──────────────────────┐
 │ len(u32)│ crc(u32)│ payload (len bytes)  │
 └─────────┴─────────┴──────────────────────┘
       4         4              N
  • Both u32 fields are little-endian.
  • CRC is over the payload only.
  • len == 0 is the EOF sentinel (an empty payload cannot be appended).

CRC32 — table-driven, reflected

poly = 0xEDB88320  // reflected IEEE 802.3 polynomial
table[256]: built once at startup
for each input byte b:
    crc = (crc >> 8) ^ table[(crc & 0xff) ^ b]
return crc ^ 0xFFFFFFFF                  // final XOR
initial value before processing: 0xFFFFFFFF

Known-answer vectors

inputCRC32 hex
""0x00000000
"a"0xE8B7BE43
"123456789"0xCBF43926

Pin these in every language's unit tests. They are the canonical crc32 IEEE vectors used by zlib, gzip, Ethernet, and the LevelDB log.

Rust outline

#![allow(unused)]
fn main() {
pub fn crc32_ieee(bytes: &[u8]) -> u32 {
    let mut c: u32 = 0xFFFF_FFFF;
    for &b in bytes {
        c = (c >> 8) ^ TABLE[((c & 0xff) ^ b as u32) as usize];
    }
    c ^ 0xFFFF_FFFF
}
}

Go outline

func Crc32IEEE(b []byte) uint32 {
    c := uint32(0xFFFFFFFF)
    for _, x := range b {
        c = (c >> 8) ^ table[byte(c)^x]
    }
    return c ^ 0xFFFFFFFF
}

C++ outline

inline std::uint32_t Crc32Ieee(std::span<const std::uint8_t> b) noexcept {
    std::uint32_t c = 0xFFFFFFFFu;
    for (auto x : b) c = (c >> 8) ^ kTable[(c & 0xff) ^ x];
    return c ^ 0xFFFFFFFFu;
}

Trap: which CRC?

There are at least eight in common use. IEEE (reflected, init 0xFFFFFFFF, final XOR 0xFFFFFFFF) is what we want. 0x04C11DB7 un-reflected is not the same value despite being the same polynomial.

If your test gives 0x4DBDF21C for "a", you're using CRC-32C (Castagnoli). Different polynomial, different table.

Step 2 — Append, sync, iterate

Goal

Implement Wal::open / append / sync / iter consistently in all three languages.

API recap

open(path)        -> Wal      // O_RDWR | O_CREAT, scan-and-truncate the tail
append(payload)  -> u64       // returns the record's start offset
sync()           -> ()        // fdatasync (or fsync on platforms without it)
len()            -> u64       // bytes in the live valid region
iter(path)       -> Iterator  // yields each payload until first short/bad record

open — scan & truncate

The crucial subroutine. After a crash, the file may end in a partial header or partial payload. open finds the last valid record's end and truncates the file to that length, so subsequent appends append cleanly.

pos = 0
loop:
    if file_size - pos < 8: break              // not enough for header
    read 8 bytes at pos: (len, crc)
    if len == 0: break                          // EOF sentinel / sparse hole
    if pos + 8 + len > file_size: break         // payload short
    read len bytes at pos+8
    if crc32(payload) != crc: break
    pos += 8 + len
if pos != file_size:
    ftruncate(file, pos)
return Wal { fd, write_offset = pos }

append

hdr[0..4] = len.to_le_bytes()
hdr[4..8] = crc32(payload).to_le_bytes()
pwrite(fd, hdr,     write_offset)
pwrite(fd, payload, write_offset + 8)
offset_returned = write_offset
write_offset += 8 + len
return offset_returned

We do not fsync inside append. Callers do that explicitly via sync() to enable group commit.

sync

  • Linux: fdatasync(fd)
  • macOS: fcntl(fd, F_FULLFSYNC, 0) for true device-level sync; falls back to fsync(fd) if F_FULLFSYNC fails (e.g., not on APFS).
  • Windows: FlushFileBuffers(handle) (out of scope here).

In this lab we use fdatasync (Linux) and fsync (macOS) for simplicity; production should consider F_FULLFSYNC on macOS because plain fsync does not guarantee device-level durability on Apple's filesystems.

iter — read-only replay

Mirrors open's scan loop but yields each payload instead of advancing a write cursor. Stops on the same conditions (len == 0, short header, short payload, bad CRC). Never panics on garbage.

Tests to pin behavior

#TestExpected
T1Append "A", "B", reopen, iter → ["A", "B"]Both records returned in order
T2Append, truncate WAL by 1 byte (cut payload), reopen, iterLast record dropped
T3Append, flip a payload byte, iterReader stops at bad CRC
T4Append, write \0\0\0\0\0\0\0\0 past EOF, reopenFile length restored to pre-garbage size
T5append() returned offsets are strictly increasing and equal to file size after that appendYes

Gotchas

  • macOS fsync does not flush the disk write cache. Use F_FULLFSYNC for tests that must outlive a power loss.
  • Rust File::write_all does not call flush on the kernel level, only the userspace BufWriter. We use raw pwrite via nix / std::os::unix::fs::FileExt::write_all_at to skip the userspace buffer entirely.
  • Go os.File.Write is unbuffered by default, but bufio.Writer is not. Make sure your Wal does not wrap the file in a bufio.Writer — that defers writes invisibly and confuses sync.

Step 3 — Group commit benchmark

Goal

Quantify the cost of fsync and the throughput win from group commit.

Workload

bench-group PATH N BATCH:

for i in 0..N:
    append(payload)
    if (i+1) % BATCH == 0: sync()
sync()   // final

PATH is a brand-new file each run. N = 50_000 is a good starting point on a modern SSD.

Numbers to look for (M2 Pro, APFS, 64-byte payload)

BatchThroughputAvg latency / syncBytes flushed / sync
1~1,800 rec/s~560 µs~72 B
8~12,000 rec/s~670 µs~576 B
64~110,000 rec/s~580 µs~4.6 KB
512~260,000 rec/s~1.0 ms~37 KB
4096~310,000 rec/s~13 ms~295 KB

Two effects worth noting:

  • Sync time is roughly constant up to ~4KB: the bottleneck is the per-fsync overhead (syscall + journal commit), not the byte count.
  • Returns diminish past batch ~256: bandwidth becomes the next limit. Past ~4096 you start hitting tail-latency cliffs.

What "broken" looks like

  • Per-record throughput is the same as group=64: your sync() isn't doing anything (no-op, wrong fd, or bufio.Writer swallowing the write).
  • Throughput keeps climbing past group=4096: you may not be calling sync() at all between batches.
  • macOS numbers look impossibly fast: plain fsync does not flush the device cache. Re-run with F_FULLFSYNC to compare.

Comparing to a Linux box

On NVMe + ext4:

BatchThroughput
1~3,000 rec/s
64~180,000 rec/s
4096~600,000 rec/s

The shape is identical; absolute numbers depend on the device's flush latency.

Bloom Filters and Hashing

Status: complete — runnable in Rust, Go, C++.

1. What Is It

A Bloom filter is a probabilistic set: add(x) always succeeds; contains(x) returns either definitely not present or probably present. It uses a fixed-size bit array m and k independent hash functions; add(x) sets bits at positions h_1(x) mod m, …, h_k(x) mod m; contains(x) returns true iff all those bits are set.

add("foo"):
   h1=37 h2=812 h3=4    →  bits[37]=bits[812]=bits[4]=1

contains("bar"):
   h1=99 h2=812 h3=120  →  bits[99]=0  ⇒  definitely absent
contains("foo"):
   h1=37 h2=812 h3=4    →  all 1  ⇒  probably present

False positives are inherent (any other key that hits the same k bits looks present); false negatives are impossible (a stored key set its bits, and we never unset).

2. Why It Matters

Without a bloom filterWith one
LevelDB / RocksDB Get(k) on a miss probes every SSTable's index → many disk readsOne in-memory bit-test per SSTable rejects 99% of misses
Distributed cache: "do I have this key?" requires a network RTTLocal bit-test on a 1 MB filter answers in nanoseconds
Spell-checker holds full dictionaryFew bits per word
Webcrawler revisits the same URLA few bits per URL prevent recrawl

Filter sizes are tiny: at the textbook optimum (~9.6 bits/key for 1% FPR) a million keys fit in 1.2 MB. Cache-resident.

3. How It Works

For n inserts into m bits with k hashes (assuming independent uniform hashes), the probability a given bit is still zero is (1 - 1/m)^(kn) ≈ e^(-kn/m), so the false-positive rate is

$$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k$$

Differentiating with respect to k yields the optimal hash count

$$k^* = \frac{m}{n}\ln 2 \approx 0.693 \cdot \frac{m}{n}$$

and the achievable FPR at $k^*$:

$$\text{FPR}^* \approx 0.6185^{,m/n}$$

So 10 bits/key ⇒ ~1% FPR with 7 hashes; 20 bits/key ⇒ ~0.01% with 14 hashes.

Kirsch–Mitzenmacher double hashing

We do not compute k independent hashes. Per Kirsch & Mitzenmacher (2006), it is sufficient — with no measurable FPR penalty — to compute one 64-bit hash, split it into halves h1 and h2, and synthesize:

g_i(x) = h1(x) + i * h2(x)   for i = 0..k-1

This is what LevelDB, RocksDB, and most production filters do.

In this lab the underlying 64-bit hash is FNV-1a64 of the key, then mixed once through SplitMix64 to spread the bits. (FNV-1a64 alone is biased in its high bits, and the Kirsch–Mitzenmacher splitting cares about both halves being well-distributed.)

On-disk / on-wire layout

 ┌─────────┬─────────┬───────────────────────────┐
 │ k (u32) │ m (u64) │  bits  (⌈m/8⌉ bytes, LE)  │
 └─────────┴─────────┴───────────────────────────┘

Identical layout across Rust/Go/C++ so all three can read each other's filters byte-for-byte.

4. Core Terminology

TermDefinition
mBit-array size in bits.
nNumber of distinct keys inserted.
kNumber of hash functions per key.
FPRFalse-positive rate at query time. False negatives are impossible.
Bits/keym / n. The single knob that determines achievable FPR.
SaturationOnce a large fraction of bits are 1, FPR climbs sharply; filters should be sized for the maximum expected n.
Counting BloomVariant that supports remove by storing 4-bit counters per slot. Costs 4× memory.
Cuckoo filterModern alternative: supports delete, often lower space at FPR ≤ 1%, harder to size.
Xor filterStatic (build once, query many) — best space efficiency, but no incremental inserts.

5. Mental Models

  1. Bloom is a hash-collision amplifier. One hash collision is rare; needing k of them simultaneously is rarer. The filter trades memory for that compounding.
  2. A Bloom filter is a negative index. Use it to avoid work; never use it to prove presence.
  3. Hash quality matters less than independence. Once individual bits are well-distributed, the Kirsch–Mitzenmacher trick gives you arbitrarily many "independent" hashes for free.
  4. You can compose them. Union ⇒ bitwise OR (with same m, k). Approximate intersection ⇒ bitwise AND (overestimates).

6. Common Misconceptions

  • "FPR depends on the number of bits set." It depends on m, n, and k. Two filters with the same fill factor but different k have different FPR.
  • "Bigger k is always better." Past $k^*$, FPR climbs again because each insert sets more bits, accelerating saturation.
  • "I can resize a Bloom filter." No — bit positions depend on m. Resize by building a fresh filter from the underlying data (or by maintaining a scalable filter, which is a series of growing Bloom filters).
  • "Cryptographic hashes are required." Wasted CPU. Anything well-distributed and fast (FNV, xxhash, MurmurHash3, CityHash) works.
  • "remove would be cheap if I just cleared the bits." It would also clear bits set by every other key that shares positions. Counting Bloom exists for this reason.

7. Interview Talking Points

  • Derive $k^* = (m/n) \ln 2$ and the resulting FPR formula from first principles.
  • Explain Kirsch–Mitzenmacher and why it doesn't increase FPR (citation: Less Hashing, Same Performance, ESA 2006).
  • Walk through how RocksDB pairs a Bloom filter with each SSTable — and how the new ribbon filter improves on that.
  • Quantify: "for 1% FPR you need ~10 bits/key; for one-in-a-million, ~30."
  • Contrast Bloom vs. Cuckoo vs. xor filters and their tradeoffs.

8. Connections to Other Labs

  • db-06 — every SSTable carries an embedded Bloom filter.
  • db-07 — compaction rebuilds filters because input filters can't be merged exactly.
  • db-08 — filter block is cached separately from data blocks.
  • db-09LookupKey flow consults the per-table filter before reading the index block.
  • db-21 — prefix Bloom filters, partitioned filters, ribbon filters.

References — Bloom Filters and Hashing

Foundational papers

  • Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, CACM 1970. The original 2-page paper. https://dl.acm.org/doi/10.1145/362686.362692
  • Adam Kirsch & Michael Mitzenmacher, Less Hashing, Same Performance: Building a Better Bloom Filter, ESA 2006. https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
  • Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher, Cuckoo Filter: Practically Better Than Bloom, CoNEXT 2014. https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
  • Thomas M. Graf & Daniel Lemire, Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, JEA 2020. https://arxiv.org/abs/1912.08258
  • Peter C. Dillinger & Stefan Walzer, Ribbon Filter: Practically Smaller Than Bloom and Xor, 2021. https://arxiv.org/abs/2103.02515

Production code to read

  • LevelDB filter policy: https://github.com/google/leveldb/blob/main/util/bloom.cc
  • RocksDB filter blocks: https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter
  • RocksDB ribbon filter implementation: https://github.com/facebook/rocksdb/blob/main/util/ribbon_impl.h

Survey & blog posts

  • Daniel Lemire, "All about Bloom filters" series: https://lemire.me/blog/tag/bloom-filter/
  • Jeff Dean's classic numbers-every-programmer-should-know — useful when sizing filters against disk-seek and RAM costs.
  • Hadron, "How RocksDB sizes filters": https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

Hash functions

  • Fowler–Noll–Vo (FNV) reference: http://www.isthe.com/chongo/tech/comp/fnv/
  • SplitMix64 (Vigna & Steele's high-quality mixer): https://prng.di.unimi.it/splitmix64.c
  • Austin Appleby, MurmurHash3: https://github.com/aappleby/smhasher/wiki/MurmurHash3
  • xxHash by Yann Collet: https://github.com/Cyan4973/xxHash

Analysis — Bloom Filters and Hashing

Problem statement

Build a fixed-memory probabilistic set that:

  1. Never reports false negatives (lookups for present keys always return true).
  2. Reports false positives at a tunable, well-understood rate.
  3. Is fast enough to consult in the hot path of a key-value store lookup (≈ nanoseconds).
  4. Has an on-disk representation identical across Rust, Go, and C++ so the same filter built in any language can be read by any other.

Constraints

ConstraintWhy it matters
Compact (≤ 2 bytes/key for 5% FPR)The filter is loaded into RAM beside the table it indexes.
Constant-time add and containsHot path of Get(k).
Deterministic across languagesCross-language tests must pass.
Single 64-bit hashWe synthesize k indices via Kirsch–Mitzenmacher — keeps CPU low.
No removePure Bloom. Counting variants left to db-21.

Design decisions

  1. Base hash = FNV-1a64 then SplitMix64 mixing. FNV-1a64 is trivial to implement identically across languages; SplitMix64 finalizing fixes its weak avalanche so the upper and lower 32 bits are both well-distributed. The two 32-bit halves become h1 and h2 for double hashing.
  2. Kirsch–Mitzenmacher: g_i = h1 + i*h2, all u64 arithmetic, with the final mod m using a single u128-multiplication trick (Daniel Lemire, Fast Random Integer Generation in an Interval, 2018) so we don't pay for a div.
  3. Bit array is little-endian byte-packed, bit i lives in bytes[i/8] >> (i%8) & 1. Same convention LevelDB and RocksDB use.
  4. Header = k(u32 LE) || m(u64 LE). 12 bytes total. We deliberately put k first so a partial-read of just the header reveals the hash count without needing m.
  5. No checksum on the filter itself. Bloom filters can tolerate a flipped bit (it adds at most a few keys' worth of false positives); pages-level checksumming belongs to db-06 (SSTable).
  6. new_with_fpr(n, fpr) constructor. Picks m = ceil(-n * ln(fpr) / (ln 2)^2) and k = round((m/n) * ln 2). Caps k at 30 to avoid degenerate sizing for absurdly small FPRs.

Why this design over alternatives

  • vs MurmurHash3 / xxhash: faster and arguably higher quality, but each is hundreds of lines to re-implement identically in three languages. FNV+SplitMix is 12 lines per language and indistinguishable in our use case.
  • vs k independent hashes: 2× CPU for no measurable FPR change (Kirsch & Mitzenmacher 2006).
  • vs Cuckoo / xor filters: more space-efficient at low FPR but much more code. Worth a separate lab — db-21.
  • vs in-language hashers (std::hash, hash/fnv, std::hash<string>): per-language differences — Go's maphash is randomized per process; C++ std::hash<string> is implementation-defined. None of them survive cross-language interop.

Failure modes addressed

FailureHow
FPR much higher than claimedTest V1: empirically measure FPR with 100k random queries and assert it's within 2× of the theoretical bound.
Bit packing mismatched across languagesTest V2 (cross-lang): each writer dumps its filter bytes; each reader queries it for known-present & known-absent keys.
Endian mismatch in headerAll header fields encoded little-endian explicitly.
Hash function mismatchTest V3: known FNV-1a64 vectors (""→0xcbf29ce484222325, "foobar"→0x85944171f73967e8) checked at startup.
Saturation at n ≫ plannedcontains still works; FPR climbs gracefully. Filter constructors document the assumed n.

Failure modes NOT addressed

  • Concurrent inserts. Single writer model. Concurrent add corrupts overlapping byte writes. Lock externally or use atomic OR per byte — covered in db-08 / db-21.
  • Adversarial keys. FNV-1a64 is not cryptographic — an attacker can craft collisions to inflate FPR. Use SipHash / xxh3 (with secret seed) if filter inputs are attacker-controlled.
  • Deletion. Use a counting Bloom or a cuckoo filter. See db-21.

Execution — How to Build and Run

Quick start (per language)

# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/bloombench --help

# Go
cd src/go
go test ./...
go build -o bin/bloombench ./cmd/bloombench
./bin/bloombench --help

# C++
cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build
./build/bloombench --help

CLI: bloombench

A single binary per language. Subcommands have the same shape across all three so cross-language tests can shell out polyglot.

SubcommandArgsBehaviour
hash STRone stringPrint fnv1a64=… splitmix=… h1=… h2=… for the given input
build PATH N FPRoutput path, key count, target FPRInsert keys key0..key{N-1} and write the filter to PATH
query PATH KEYfilter path, keyPrint present or absent for one key
query-file PATH KEYS_FILEfilter path, file with one key per linePrint results for each
fpr-test PATH N Mfilter path (built with N keys), M random absent keysMeasure FPR and print observed vs theoretical

Library API

fnv1a64(bytes) -> u64
splitmix64(u64) -> u64
mix64(bytes) -> u64                   // = splitmix64(fnv1a64(bytes))

BloomFilter::new(m_bits, k) -> Bloom
BloomFilter::with_fpr(n, fpr) -> Bloom
  - picks m, k optimally; caps k at 30
Bloom::add(bytes)
Bloom::contains(bytes) -> bool
Bloom::k(), Bloom::m_bits(), Bloom::m_bytes()
Bloom::encode() -> Vec<u8>            // header || bits
Bloom::decode(bytes) -> Bloom         // inverse, validates header

On-disk / on-wire layout

 ┌─────────┬─────────┬─────────────────────────┐
 │ k (u32) │ m (u64) │  bits  ⌈m/8⌉ bytes      │
 │         │         │  bit i = bytes[i/8] >> (i mod 8) & 1
 └─────────┴─────────┴─────────────────────────┘
       4         8                  ⌈m/8⌉

All integers little-endian. No padding. No internal checksum.

Verifying

./scripts/verify.sh        # per-language unit + property tests
./scripts/cross_test.sh    # writer/reader cross-product over {go,rust,cpp}

Observation — Looking at the Bits

1. Hexdump a freshly built filter

./build/bloombench build /tmp/bf 4 0.05
xxd /tmp/bf
00000000: 0500 0000 1f00 0000 0000 0000 1206 92    ...............

Reading the header: k=5, m=31 bits ⇒ ⌈31/8⌉ = 4 bytes of bits. The trailing 12 06 92 … is the bit vector with 4 keys mixed in. The actual high byte may differ depending on how m is rounded.

For 1000 keys at 1% FPR you should see roughly 9.6 bits/key ⇒ 1200 bytes of bits, and k ≈ 7.

2. Sanity-check the hash chain

./build/bloombench hash foobar
# fnv1a64=85944171f73967e8  splitmix=...  h1=...  h2=...

Known FNV-1a64 vectors (used in tests):

Inputfnv1a64
""0xcbf29ce484222325
"a"0xaf63dc4c8601ec8c
"foobar"0x85944171f73967e8

All three languages must print the same fnv1a64 and same splitmix64 for any given input. If they don't, cross-language interop is dead on arrival.

3. Empirical FPR matches the formula

./build/bloombench build /tmp/bf 100000 0.01
./build/bloombench fpr-test /tmp/bf 100000 1000000
# expected: observed=0.0098  theoretical≈0.0100   (within ±20% with 1M samples)

If observed FPR is much higher than theoretical:

  • k is wrong (probably 0 or 1 due to integer truncation; check with_fpr math).
  • Hash is biased (FNV without SplitMix mixing — the high bits are clumped).
  • mod m step has a bias (using h % m with non-prime m is OK; using h & (m-1) only works when m is a power of two).

If observed FPR is much lower: probably double-counting or your "random absent" key generator overlaps with the present set — verify input.

4. Bit density vs FPR

for fpr in 0.10 0.05 0.01 0.001 0.0001; do
  ./build/bloombench build /tmp/bf 10000 $fpr
  ls -l /tmp/bf
done

Sample row sizes (header + body):

FPRBytesBits/key
0.10~6 040~4.8
0.05~7 820~6.2
0.01~12 010~9.6
0.001~17 990~14.4
0.0001~23 970~19.2

The 9.6 bits/key heuristic for 1% FPR is the one most often quoted in interviews.

5. Cross-language byte-identical filters

for lang in go rust cpp; do
  ./src/$lang/.../bloombench build /tmp/bf.$lang 1000 0.01
done
md5sum /tmp/bf.*    # all three identical

If any digest differs, suspect (in order): endian, bit ordering inside the byte, integer types of the header, or hash mismatch.

What "working" looks like

  • Bytes 0..3 = k, bytes 4..11 = m, bytes 12..end = bits. No padding.
  • Empirical FPR is within ±2× of theoretical for any (n, fpr) you try.
  • All three languages produce identical filters and read each other's filters.

What "broken" looks like

  • contains(k) returns false for a key you just inserted ⇒ false negative ⇒ critical bug. Likely indexing math: set and get disagree about bit-within-byte.
  • FPR is 100% ⇒ all bits are 1 ⇒ either m was rounded down to 0 or you're indexing past the bit array.
  • FPR is 0% with realistic load ⇒ add is a no-op or contains always returns true on the "absent" path.
  • Cross-language readers disagree ⇒ print the first 16 bytes of each filter and the first three h1, h2 values for a known key; one of them is wrong.

Verification — What to Test and How

Per-language property tests

#TestPass if
V1fnv1a64_known_vectors""0xcbf29ce484222325; "a"0xaf63dc4c8601ec8c; "foobar"0x85944171f73967e8
V2splitmix64_known_vectorssplitmix64(0) = 0xe220a8397b1dcdaf; splitmix64(0xdeadbeef) = 0x4adfb90f68c9eb9b
V3no_false_negativesInsert N=10 000 random keys (seeded); contains returns true for every one
V4fpr_within_2xBuild for n=10 000 at fpr=0.01; query 100 000 random absent keys; observed FPR ≤ 2× theoretical
V5optimal_k_formulawith_fpr(1000, 0.01) returns k=7 and 9 580 ≤ m ≤ 9 620 (allow ±0.5%)
V6encode_decode_roundtripencode → decode → query the same keys: identical results
V7header_layoutFirst 4 bytes = k LE; next 8 = m LE; payload length = ⌈m/8⌉
V8empty_filter_rejects_allNew filter with m=64, k=3; contains returns false for 1000 random keys

Cross-language test

scripts/cross_test.sh performs the writer × reader matrix for {go, rust, cpp}²:

  1. Each writer builds a filter for the same fixed-seed key set (1 000 keys).
  2. Filters must be byte-identical (md5sum over filter file).
  3. Each reader opens each writer's filter and runs:
    • 1 000 known-present queries → must all return present
    • 1 000 known-absent queries (different seed) → results must match across readers

This catches:

  • Endian or bit-order bugs in the header / bit array.
  • Hash mismatch (fnv1a64 or splitmix64 differs).
  • mod m reduction differs (Lemire's u128 trick vs % should yield identical indices).

What "passing" means

  • All 8 property tests green in all three languages.
  • cross_test.sh exits 0 with 9 byte-identical filter writers and 9 passing reader runs.
  • Manual smoke: hexdump of a 4-key filter matches the structure described in docs/observation.md.

Broader Ideas — Beyond the Minimum

Block / partitioned filters

RocksDB partitions the filter so that one filter probe touches only a single cache line. Trade: marginally higher FPR for ~3× faster contains on cache-cold filters. See Optimizing Bloom Filter: Challenges, Solutions, and Comparisons (Luo et al., IEEE 2019).

Cuckoo filters

Replace bit array with a hash table of fingerprints. Same FPR as Bloom at lower bits/key (~6 bits/key for 1% FPR), and supports remove. Slower to build, occasionally fails to insert when over-full. Excellent for membership tests with a known max size and a need for deletion.

Xor filters

Build-once, query-many. ~9% smaller than Cuckoo at the same FPR, faster lookup (always exactly 3 memory accesses). Bad fit if you insert incrementally; great for static datasets like compiled SSTable filters.

Ribbon filters (RocksDB 6.15+)

A linear-algebra reformulation of xor filters. ~30% smaller than Bloom at the same FPR, slightly slower lookup, ~10× slower to build. RocksDB now uses these as the default for SSTable filters.

Prefix Bloom filters

When most queries are by prefix (e.g. userid:12345:*), build the filter from prefixes instead of full keys. Saves space and lets prefix-range queries use the filter.

Scaling without resizing

A scalable Bloom filter (Almeida et al., 2007) chains a sequence of progressively larger filters with progressively tighter FPRs. add writes to the youngest; contains ORs across all. Memory grows logarithmically with n.

Compressed Bloom filters

If you transmit a Bloom filter over a network, sparsity makes it gzip well. Mitzenmacher (2002) showed that optimizing for compressed size leads to a different k than optimizing for in-memory FPR.

Cardinality estimation: HyperLogLog vs Bloom

Bloom tells you "in set" with FPR; HLL gives you |set| ± ~2% with constant memory. Often used in the same systems for different questions.

Filters in distributed systems

  • Bigtable / HBase: block-level filters per SSTable.
  • Cassandra: row-level filter per SSTable, plus a key cache.
  • Akamai / CDNs: "did this URL get cached?" Bloom-based pre-checks.
  • Gmail: per-user spam fingerprint filters.
  • Bitcoin SPV clients (BIP 37): filters published to full nodes to indicate which addresses the SPV wallet cares about. Famously broken from a privacy standpoint — the filter leaks the address set.

Adversarial considerations

Bloom-filter parameters and hashes are usually public. If users can choose keys, they can craft collisions that fill the filter and push FPR to 100%. Defenses:

  1. Use a keyed hash (SipHash) seeded at filter creation.
  2. Cap inserts or fall back to an exact structure beyond a threshold.
  3. Periodically rebuild.

Information-theoretic lower bound

Carter et al. (1978) prove that any approximate set with FPR ε requires at least n * log2(1/ε) bits — that's ~6.64 bits/key for 1% FPR. Bloom uses ~9.6 (44% overhead). Xor filters approach the bound at ~9% overhead. Ribbon filters get within ~3%.

Step 01 — Hash chain (FNV-1a64 → SplitMix64 → double hashing)

Before any filter logic, get the hash chain identical across all three languages. If fnv1a64("foobar") doesn't return 0x85944171f73967e8 everywhere, nothing else will work.

1. FNV-1a64

Algorithm:

hash = 0xcbf29ce484222325
for byte in input:
    hash ^= byte
    hash = hash * 0x100000001b3        // wrapping 64-bit multiplication
return hash

Known test vectors:

InputResult
"" (empty)0xcbf29ce484222325 (the initial value)
"a"0xaf63dc4c8601ec8c
"foobar"0x85944171f73967e8

Side-by-side:

#![allow(unused)]
fn main() {
pub fn fnv1a64(bytes: &[u8]) -> u64 {
    let mut h: u64 = 0xcbf29ce484222325;
    for &b in bytes {
        h ^= b as u64;
        h = h.wrapping_mul(0x100000001b3);
    }
    h
}
}
func FNV1a64(b []byte) uint64 {
    var h uint64 = 0xcbf29ce484222325
    for _, x := range b {
        h ^= uint64(x)
        h *= 0x100000001b3
    }
    return h
}
std::uint64_t Fnv1a64(const std::uint8_t* p, std::size_t n) {
    std::uint64_t h = 0xcbf29ce484222325ULL;
    for (std::size_t i = 0; i < n; ++i) {
        h ^= p[i];
        h *= 0x100000001b3ULL;
    }
    return h;
}

⚠️ Two-byte traps: don't use FNV-1 (not 1a — different order of XOR vs multiply); don't use the 32-bit prime or basis (different constants).

2. SplitMix64 finalizer

FNV-1a64 has decent low bits but biased high bits. SplitMix64 (Vigna & Steele) is a single-step bit mixer that produces near-perfect avalanche on a 64-bit input. We apply it to FNV's output so that both the upper and lower 32-bit halves are usable as independent hashes.

splitmix64(x):
    x = x + 0x9e3779b97f4a7c15
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9
    x = (x ^ (x >> 27)) * 0x94d049bb133111eb
    return  x ^ (x >> 31)

Known vectors:

InputOutput
00xe220a8397b1dcdaf
0xdeadbeef0x4adfb90f68c9eb9b
splitmix64(fnv1a64("foobar"))use the test to lock this in

The combined hash we actually use:

mix64(bytes) := splitmix64(fnv1a64(bytes))
h1 = mix64(bytes) & 0xffffffff               // low 32 bits
h2 = mix64(bytes) >> 32                       // high 32 bits

3. Synthesizing k indices (Kirsch–Mitzenmacher)

For a bit array of size m and k hashes:

for i in 0..k:
    g = h1 + i * h2          // wrapping 64-bit add
    idx = g mod m            // reduce to [0, m)
    use bits[idx]

The reduction g mod m is hot. Naive % works but a modulo on a 64-bit integer is ~20 cycles. RocksDB and others use Lemire's fast reduction:

fastmod(g, m) = ((g as u128) * (m as u128)) >> 64

(Equivalent to floor(g * m / 2^64), a near-uniform map of [0, 2^64) to [0, m).) Either approach is fine for correctness — but pick one and use it in all three languages, otherwise the bit positions diverge and cross-language tests break.

This lab uses plain % because it's identical across languages with no language-specific u128 syntax to worry about. Performance difference is irrelevant at filter-construction scale.

Test gate

Before moving on, all three bloombench hash foobar invocations must print:

fnv1a64=85944171f73967e8
splitmix=...        (same value across languages)
h1=...  h2=...      (same values across languages)

If those don't match, the rest of the lab cannot succeed.

Step 02 — Bit array, add, contains

The bit array

Backed by a Vec<u8> / []byte / std::vector<uint8_t> of length ⌈m / 8⌉. Indexing is little-endian within each byte:

bit i  →  byte index  = i / 8
         bit within  = i % 8         // bit 0 is the LSB
         test:  (bytes[i/8] >> (i%8)) & 1
         set:    bytes[i/8] |= 1 << (i%8)

Side-by-side:

#![allow(unused)]
fn main() {
fn set_bit(bits: &mut [u8], i: u64) {
    let idx = (i / 8) as usize;
    let off = (i % 8) as u8;
    bits[idx] |= 1u8 << off;
}
fn get_bit(bits: &[u8], i: u64) -> bool {
    let idx = (i / 8) as usize;
    let off = (i % 8) as u8;
    (bits[idx] >> off) & 1 == 1
}
}
func setBit(bits []byte, i uint64) {
    bits[i/8] |= 1 << (i % 8)
}
func getBit(bits []byte, i uint64) bool {
    return bits[i/8]&(1<<(i%8)) != 0
}
inline void SetBit(std::uint8_t* b, std::uint64_t i) {
    b[i / 8] |= std::uint8_t{1} << (i % 8);
}
inline bool GetBit(const std::uint8_t* b, std::uint64_t i) {
    return (b[i / 8] >> (i % 8)) & 1u;
}

⚠️ Pick one bit-order now. LSB-first (above) matches LevelDB and is the natural choice in C. MSB-first matches some networking specs (TCP option encoding). Whichever you pick, all three implementations must agree.

add(key)

add(key):
    h = mix64(key)
    h1 = h & 0xffffffff
    h2 = h >> 32
    for i in 0..k:
        idx = (h1 + i * h2) mod m
        set_bit(idx)

Notes:

  • All arithmetic is wrapping 64-bit (u64/uint64/std::uint64_t).
  • i * h2 overflows for large k. That's fine — mod m will still produce a valid index. Languages with overflow checks (Rust debug mode) need wrapping_mul/wrapping_add.
  • We compute h once per key, then derive k indices. That's the entire Kirsch–Mitzenmacher win.

contains(key)

contains(key):
    h = mix64(key)
    h1 = h & 0xffffffff
    h2 = h >> 32
    for i in 0..k:
        idx = (h1 + i * h2) mod m
        if not get_bit(idx):
            return false
    return true

Early-exit on the first zero bit. With FPR 1% and a random absent key, you typically exit after 1 or 2 probes.

Three-language add skeleton

#![allow(unused)]
fn main() {
pub fn add(&mut self, key: &[u8]) {
    let h = mix64(key);
    let h1 = h as u32 as u64;
    let h2 = h >> 32;
    for i in 0..self.k as u64 {
        let idx = h1.wrapping_add(i.wrapping_mul(h2)) % self.m;
        set_bit(&mut self.bits, idx);
    }
}
}
func (b *Bloom) Add(key []byte) {
    h := Mix64(key)
    h1 := h & 0xffffffff
    h2 := h >> 32
    for i := uint64(0); i < uint64(b.k); i++ {
        idx := (h1 + i*h2) % b.m
        setBit(b.bits, idx)
    }
}
void Bloom::Add(const std::uint8_t* k, std::size_t n) {
    std::uint64_t h = Mix64(k, n);
    std::uint64_t h1 = h & 0xffffffffULL;
    std::uint64_t h2 = h >> 32;
    for (std::uint64_t i = 0; i < k_; ++i) {
        std::uint64_t idx = (h1 + i * h2) % m_;
        SetBit(bits_.data(), idx);
    }
}

Test gate

  • Insert keys "k0".."k999" (UTF-8). contains("kN") must be true for every N. Any false negative is a critical bug.
  • Filter must be byte-identical across the three languages (md5sum).

What broken looks like

  • contains is sometimes false for present keys → set and get disagree on bit-within-byte. Suspect LSB vs MSB.
  • Cross-language filters differ → mod m reduction differs (one uses & instead of % and m isn't a power of two), or h1/h2 halves are swapped.
  • contains is always true → m was constructed as 0; bit array is empty so every (h % 0) panics or all indices land in a never-cleared byte.

Step 03 — Sizing, encode/decode, FPR measurement

Picking m and k from (n, fpr)

Given target false-positive rate p and expected key count n:

$$m = \left\lceil \frac{-n \cdot \ln p}{(\ln 2)^2} \right\rceil$$

$$k = \max\left(1,; \text{round}\left(\frac{m}{n} \cdot \ln 2\right)\right)$$

Reference numbers (compute once, hard-code in tests):

npm (bits)bits/keyk
1 0000.10~4 7934.793
1 0000.01~9 5869.597
1 0000.001~14 37814.3810
10 0000.01~95 8519.597

Implementation:

with_fpr(n, p):
    ln2     = ln(2)
    m_real  = -(n as f64) * ln(p) / (ln2 * ln2)
    m       = ceil(m_real)
    k_real  = (m as f64 / n as f64) * ln2
    k       = round(k_real) clamped to [1, 30]
    return BloomFilter::new(m, k)

The clamp on k prevents pathological cases. with_fpr(1, 1e-100) would request k ≈ 332 and almost certainly saturate the filter.

Encode

[ k: u32 LE ][ m: u64 LE ][ bits: ⌈m/8⌉ bytes ]
#![allow(unused)]
fn main() {
pub fn encode(&self) -> Vec<u8> {
    let mut out = Vec::with_capacity(12 + self.bits.len());
    out.extend_from_slice(&self.k.to_le_bytes());
    out.extend_from_slice(&self.m.to_le_bytes());
    out.extend_from_slice(&self.bits);
    out
}
}
func (b *Bloom) Encode() []byte {
    out := make([]byte, 12+len(b.bits))
    binary.LittleEndian.PutUint32(out[0:4], b.k)
    binary.LittleEndian.PutUint64(out[4:12], b.m)
    copy(out[12:], b.bits)
    return out
}
std::vector<std::uint8_t> Bloom::Encode() const {
    std::vector<std::uint8_t> out(12 + bits_.size());
    EncodeU32LE(out.data() + 0, k_);
    EncodeU64LE(out.data() + 4, m_);
    std::memcpy(out.data() + 12, bits_.data(), bits_.size());
    return out;
}

Decode

decode(bytes):
    if len(bytes) < 12: error
    k = read u32 LE @ 0
    m = read u64 LE @ 4
    body = bytes[12:]
    if len(body) != ⌈m/8⌉: error
    return BloomFilter { k, m, bits: body }

Validate sizes. If k == 0 or m == 0, reject — those are nonsense.

Measuring FPR

fpr-test(filter, n, m_queries):
    seed reader rng to disjoint stream
    hits = 0
    for q in 0..m_queries:
        key = generate distinctly-not-inserted key
        if filter.contains(key):
            hits += 1
    observed = hits / m_queries
    theoretical = (1 - exp(-k * n / m))^k
    print observed, theoretical

Generating known-absent keys: use the same family as the inserted ones, but with indices ≥ n. If insert used "key0", "key1", ..., "key{n-1}", query with "q0", "q1", ... — different prefix guarantees no accidental overlap.

A million absent queries gives ±10% noise on a 1% FPR estimate; that's the sample size used in the test fpr_within_2x.

Test gate

  • with_fpr(1000, 0.01) returns k=7 and m[9 581, 9 591].
  • encode then decode gives an identical filter.
  • fpr-test with n=10 000, m_queries=100 000 reports observed FPR within 2× of theoretical (well within ±50%).
  • The encoded filter is byte-identical across Rust / Go / C++.

What broken looks like

  • k=0 from with_fpr → integer truncation; you used int(k_real) instead of round.
  • Decode fails on a perfectly valid file → endian mismatch or header offset wrong.
  • Observed FPR is exactly 1.0 → bit array got written but indices land outside its range (modulo bug).
  • Observed FPR is exactly 0.0 → contains always returns false; bit array isn't being touched on add (you forgot to mutate self).

LSM MemTable

Lab: db-05 — the in-memory write buffer of an LSM-tree.

1. What Is It

A MemTable is the in-memory, sorted write buffer at the top of every Log-Structured Merge tree (LSM). All writes — put, delete, range updates — land in the MemTable first, indexed by key, and only later get flushed to immutable on-disk SSTables (see db-06). It is paired with a Write-Ahead Log (db-03) for durability: WAL gives crash recovery; the MemTable gives fast point and range lookups.

This lab implements a deterministic, byte-identical MemTable across Rust, Go, and C++ that can be serialized to disk and read back in any of the three languages.

2. Why It Matters

  • Write throughput. Writes touch only RAM (plus a single sequential WAL append). Random puts become sequential disk traffic.
  • Read recency. The MemTable is the freshest copy of any key; a get must consult it first before falling through to L0..Ln SSTables.
  • Flush boundary. Once the MemTable hits its size cap (write_buffer_size in LevelDB/RocksDB), it freezes, a new MemTable rotates in, and the frozen one is written sequentially to an SSTable on background threads.
  • Tombstones. Deletes are inserts of tombstone records; the MemTable must preserve them through flush so older SSTables can be shadowed.

3. How It Works

                writes                      reads
                  │                           │
                  ▼                           ▼
   ┌──────── MemTable (active) ─────────┐  point/range
   │   sorted map: key → (type, value)  │◄──────────┐
   │   tombstones live alongside values │           │
   └──────────────────┬─────────────────┘           │
        size > cap?   │                              │
                      ▼                              │
       ┌── Immutable MemTable (frozen) ─┐            │
       │   flushing in the background    │◄───────────┤
       └──────────────────┬──────────────┘           │
                          ▼                          │
                   SSTable on disk ─────────────────►┘
                   (db-06 format)

Internally the MemTable is a sorted associative container with byte-lexicographic key ordering:

  • Rust: BTreeMap<Vec<u8>, Entry> (Vec<u8>'s Ord is lex over bytes).
  • Go: map[string]Entry + key slice sorted on dump/iteration.
  • C++: std::map<std::vector<uint8_t>, Entry> (operator< on vectors is lex).

Production LSMs (LevelDB, RocksDB) use a skip list because it offers concurrent lock-free reads and allocations from an arena. For this lab the simpler tree is fine — what matters is the order-determinism and the on-disk byte layout.

4. Core Terminology

TermDefinition
MemTableSorted in-memory map of keys to values/tombstones; the LSM write buffer.
Immutable MemTableA frozen MemTable, no longer accepting writes, awaiting flush.
TombstoneA delete marker stored as an entry of type Delete; needed because older SSTables may still hold the key.
Skip listRandomized layered linked-list giving expected O(log n) insert/lookup; LevelDB/RocksDB's choice.
FlushWriting a frozen MemTable out as an SSTable.
Sequence numberMonotonically increasing version tag attached to each write so readers can pick the right snapshot.
ArenaBump allocator that backs MemTable nodes; freed in one go when the table is dropped.

5. Mental Models

  • Three-layer journal. WAL = durability log. MemTable = sorted index over the WAL's recent tail. SSTable = compacted, immutable snapshot. The MemTable is the short-term, queryable face of the WAL.
  • Latest write wins. For a single point lookup the MemTable always shadows any on-disk data; a tombstone in the MemTable hides every prior value of that key.
  • Flush is amplification's first knob. Larger MemTables → fewer, bigger L0 SSTables → less compaction work but more recovery time and RAM. Production tunes this between 16 MiB and 256 MiB.
  • Why sorted? Because the flush writes the SSTable in a single sequential pass — no on-disk sort needed if the MemTable is already ordered.

6. Common Misconceptions

  • "The MemTable is the WAL." No. The WAL is unsorted, append-only, and may contain redundant updates for the same key. The MemTable is sorted and deduplicated.
  • "Tombstones can be GC'd in the MemTable." No — they must be flushed; only after compaction confirms no older SSTable holds the key can a tombstone be dropped.
  • "You can skip the WAL if writes are batched." The MemTable lives in RAM. Without the WAL a crash loses every unflushed write.
  • "Skip list is the only valid structure." A B-tree, ART, or sorted vector with occasional rebuild are all viable; skip list wins for the specific concurrency pattern of one writer + many readers.

7. Interview Talking Points

  • Explain why an LSM uses a MemTable + WAL instead of writing directly to a sorted on-disk file (random I/O kills throughput).
  • Walk through the lifecycle: put → WAL append → MemTable insert → eventually frozen → flushed → compacted.
  • Describe how a get traverses MemTable → immutable MemTable → L0 SSTables → Ln, stopping at the first match (value or tombstone).
  • Cost of tombstones: read amplification grows because we cannot skip a level just because we found nothing; we might still find a tombstone later.
  • Why a MemTable's flush is a sorted sequential write — and why this is the primary trick that makes LSMs faster than B-trees for write-heavy workloads.

8. Connections to Other Labs

  • db-03 (WAL): every MemTable write is preceded by a WAL append; the WAL is replayed into a fresh MemTable on startup.
  • db-04 (Bloom filters): SSTables produced by MemTable flush carry Bloom filters for negative lookups.
  • db-06 (SSTable format): the flush target; this lab's flush_to is the producer side of db-06's open.
  • db-07 (compaction): consumes SSTables that came from MemTable flushes.
  • db-09 (LevelDB complete): stitches all of the above into a working KV store.

References — db-05 LSM MemTable

Primary sources

Skip lists

Alternative data structures

  • Bw-tree (Microsoft, 2013): lock-free B+ tree variant used in Hekaton.
  • Adaptive Radix Tree (ART, 2013): compact, cache-friendly trie used by HyPer and DuckDB. https://db.in.tum.de/~leis/papers/ART.pdf
  • Masstree (Mao, Kohler, Morris, 2012): trie-of-B+trees, very fast for variable length keys.

Tombstones and snapshot reads

  • RocksDB DeleteRange. Tombstones over key ranges, important for prefix deletes. https://github.com/facebook/rocksdb/wiki/DeleteRange

  • LevelDB sequence numbers. Each MemTable entry is internally tagged with a 56-bit sequence and 8-bit type byte; this lab simplifies to just the type byte. See db/dbformat.h kValueTypeForSeek.

Real-world tunings

  • Cassandra: uses Memtable with off-heap allocators; flushed to SSTables on size, time, or commit-log pressure.
  • HBase: MemStore per column family; configurable via hbase.hregion.memstore.flush.size.
  • InfluxDB IOx & TimescaleDB: apply LSM ideas to time-series, with time-bucketed MemTables.

Further reading

Analysis — db-05 LSM MemTable

Problem

Implement the in-memory write buffer of an LSM-tree such that

  1. it supports put, delete (tombstone insertion), get, and ordered iteration;
  2. it can be serialized to disk in a deterministic byte layout shared by Rust, Go, and C++;
  3. the same dump can be reloaded in any of the three languages;
  4. iteration order is byte-lexicographic on keys.

Constraints

  • Keys are arbitrary byte sequences up to 2^32 − 1 bytes (u32 length prefix).
  • Values are arbitrary byte sequences up to 2^32 − 1 bytes; for tombstones the value length is 0.
  • Cross-language interop: the dump format must be identical byte-for-byte and the cross-test script asserts SHA-256 equality of the three dumps.
  • No allocator tricks: simplicity over LevelDB-style arena/skiplist; we use the standard sorted map in each language.
  • No concurrency in this lab: single-threaded API. Concurrency arrives in db-09.

Design decisions

Why a sorted associative container, not a skip list?

Production LSMs (LevelDB, RocksDB) use skip lists because they support concurrent lock-free reads and arena allocation. For this teaching lab those benefits are irrelevant — what matters is determinism, byte-identical serialization, and the fact that iteration is in key order so the flush is a sequential write. Any sorted container satisfies that:

LanguageContainerWhy
RustBTreeMap<Vec<u8>, Entry>Vec<u8>: Ord is byte-lex; balanced.
Gomap[string]Entry + sort.Strings(keys)Avoid third-party sorted maps.
C++std::map<std::vector<uint8_t>, Entry>RB-tree; vector<uint8_t>::operator< is lex.

All three give the same iteration order for identical input, which is what cross-test checks.

Tombstones as entries

A delete(k) replaces whatever entry k had with Entry::Tombstone. Crucially the key is not erased — the tombstone must propagate to the SSTable to shadow older on-disk versions of the key.

On-disk dump layout

   offset  size  field
   ------  ----  --------------------------------
        0     4  magic ASCII "MMT1"
        4     4  count: u32 LE (entry count)
        8     ?  entries, sorted by key ascending:
                   [ klen: u32 LE ]
                   [ vlen: u32 LE ]   (0 if tombstone)
                   [ type: u8     ]   (0 = Value, 1 = Tombstone)
                   [ key bytes    ]
                   [ value bytes  ]

All multi-byte integers little-endian; the file is self-delimiting via count and each entry's two length prefixes. No checksum at this layer — the WAL (db-03) and the SSTable (db-06) carry their own.

Size accounting

size_bytes() returns the on-disk dump size assuming the current contents flush immediately: 8 bytes header + per entry (9 + klen + vlen). This is what an LSM would compare against write_buffer_size.

Error model

The decoder validates:

  • magic == MMT1,
  • enough bytes remain for each header field and the declared key/value spans,
  • type is 0 or 1,
  • tombstones have vlen == 0,
  • no trailing garbage,
  • keys appear in strictly ascending order.

A failure returns an explicit error (Error::* in Rust, error in Go, std::invalid_argument / std::runtime_error in C++) rather than panicking. The encoder cannot fail (no I/O at that layer).

Trade-offs

  • No sequence numbers. Real LSMs prepend a 64-bit (seqno << 8) | type to every internal key so MVCC snapshots can pick the right version. We collapse to "latest write wins" because db-13 reintroduces MVCC.
  • No range tombstones. Each delete shadows exactly one key. RocksDB-style range deletes are db-09 work.
  • No prefix bloom or compressed entries. The MemTable is in RAM; flushing to a proper SSTable (db-06) is where compression and block boundaries appear.
  • Allocation policy: Vec/vector/[]byte-per-entry, not an arena. Allocator pressure becomes interesting only at multi-million-key scales, which we exercise in db-22 benchmarking.

Alternatives considered

  • Skip list with arena (LevelDB style). Better concurrency, cache locality, and drop-the-whole-arena freeing — but the data structure complexity (random levels, acquire/release pointer ops) would dwarf the lab's pedagogical point.
  • Sorted vector with binary search. Lowest memory overhead, but every put is O(n) due to mid-vector insertion. Fine for tiny tables (<1 K entries), terrible beyond that.
  • HashMap with periodic sort. Fast inserts, but iteration is no longer cheap; every flush triggers a sort. Acceptable if flush is rare, painful otherwise.
  • B-epsilon tree. Batches writes inside internal nodes, blurring the line with LSM. Out of scope.

Execution — db-05 LSM MemTable

Library API (Rust shape; mirrored in Go and C++)

#![allow(unused)]
fn main() {
pub enum Entry { Value(Vec<u8>), Tombstone }

pub struct MemTable { /* sorted map */ }

impl MemTable {
    pub fn new() -> Self;
    pub fn len(&self) -> usize;
    pub fn size_bytes(&self) -> usize;
    pub fn put(&mut self, key: &[u8], value: &[u8]);
    pub fn delete(&mut self, key: &[u8]);
    pub fn get(&self, key: &[u8]) -> Option<&Entry>;
    pub fn iter(&self) -> impl Iterator<Item = (&[u8], &Entry)>;
    pub fn encode(&self) -> Vec<u8>;
    pub fn decode(bytes: &[u8]) -> Result<Self, Error>;
}
}

Go: func New() *MemTable, func (*MemTable) Put / Delete / Get / Iter / Encode, func Decode([]byte) (*MemTable, error). Iter yields a slice of (key, entry) pairs in sorted order.

C++: class MemTable with the same names; Iter() returns a const reference to the underlying std::map.

CLI: memtable

memtable <subcommand> [args]

Subcommands:
  new PATH                       create an empty MemTable at PATH
  put PATH KEY VALUE             open PATH, put, save
  del PATH KEY                   open PATH, delete (writes tombstone), save
  get PATH KEY                   print 'value: <hex>' | 'tombstone' | 'absent'
  iter PATH                      print one line per entry: TYPE KEY VALUE  (hex)
  bulk PATH N                    open or create PATH, insert key0..key{N-1}
                                 with values val0..val{N-1}, save
  size PATH                      print 'entries=N size_bytes=B'

Iter output format (deterministic, used by cross-test):

V <hex-key> <hex-value>
T <hex-key>

Hex is lowercase, no 0x prefix, no separators.

Build & test

Per language:

# Rust
cd src/rust && cargo test --release && cargo build --release

# Go
cd src/go && go test ./... && go build -o bin/memtable ./cmd/memtable

# C++
cd src/cpp && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
            && cmake --build build && ( cd build && ctest --output-on-failure )

Or run all at once:

bash scripts/verify.sh

Cross-language interop test

scripts/cross_test.sh:

  1. Build all three binaries.
  2. Drive each one through the same sequence of bulk 100, a handful of puts with overwrites, and a handful of dels.
  3. SHA-256 each dump; assert all three match.
  4. For each writer/reader pair, run iter and check the output is byte-identical.

If any pair differs, the script prints the failing combination and exits non-zero.

Manual playground

$ memtable new /tmp/m.bin
$ memtable put /tmp/m.bin alpha "first"
$ memtable put /tmp/m.bin beta  "second"
$ memtable put /tmp/m.bin alpha "first-updated"   # overwrite
$ memtable del /tmp/m.bin beta                    # tombstone
$ memtable iter /tmp/m.bin
V 616c706861 66697273742d75706461746564
T 62657461
$ memtable get /tmp/m.bin alpha
value: 66697273742d75706461746564
$ memtable get /tmp/m.bin beta
tombstone
$ memtable get /tmp/m.bin gamma
absent
$ memtable size /tmp/m.bin
entries=2 size_bytes=37

37 = 8 (header) + (9+5+13) + (9+4+0) = 8 + 27 + 13 — two entries with key "alpha"/value "first-updated" and tombstone for key "beta".

What broken looks like

SymptomLikely cause
Cross-test SHA mismatch on first byte setMagic disagreement (must be ASCII MMT1).
Cross-test SHA mismatch mid-fileEndianness or type byte placement differs.
iter order differs across langsGo's map iteration order; missed the sort.Strings.
get returns absent after delTombstone not stored, only erased.
Decoder accepts trailing garbageForgot the "consumed all bytes" check.

Observation — db-05 LSM MemTable

Hex layout of a tiny dump

Three operations: put alpha first, put beta second, del beta.

hexdump -C m.bin
00000000  4d 4d 54 31 02 00 00 00  05 00 00 00 05 00 00 00  |MMT1............|
00000010  00 61 6c 70 68 61 66 69  72 73 74 04 00 00 00 00  |.alphafirst.....|
00000020  00 00 00 01 62 65 74 61                          |....beta|

Annotated:

OffsetBytesField
04d 4d 54 31magic ASCII MMT1
402 00 00 00count = 2
805 00 00 00klen = 5 (alpha)
1205 00 00 00vlen = 5 (first)
1600type = Value
1761 6c 70 68 61key bytes alpha
2266 69 72 73 74value bytes first
2704 00 00 00klen = 4 (beta)
3100 00 00 00vlen = 0 (tombstone)
3501type = Tombstone
3662 65 74 61key bytes beta

Total: 40 bytes; matches size_bytes() = 8 + (9+5+5) + (9+4+0) = 40.

Cross-language byte equality

scripts/cross_test.sh produces three files rust.bin, go.bin, cpp.bin. With the verify script in this lab:

$ shasum -a 256 *.bin
b67…  rust.bin
b67…  go.bin
b67…  cpp.bin

If any one of the three differs we either have endianness disagreement, key ordering disagreement, or someone wrote the type byte in a different position.

Memory layout intuition

key                          entry
"abc"  ──►  Entry::Value("..."  10 bytes)
"abd"  ──►  Entry::Tombstone
"abz"  ──►  Entry::Value(""      0 bytes)   # empty value is legal and ≠ tombstone
"zz"   ──►  Entry::Value("..."  4096 bytes)

Notes:

  • The key length is not stored alongside the in-memory entry — only at encode time.
  • An empty value ("") is a valid value, distinct from Tombstone. The type byte is what discriminates them.

size_bytes() table

For a MemTable with n entries of average key length and average value length , with a fraction f being tombstones:

$$ \text{size_bytes}(n, k̄, v̄, f) = 8 + n \cdot (9 + k̄) + n(1-f) \cdot v̄ $$

For default LSM tunings:

nfsize_bytes
10 0001610001,250,008
100 000322560.0128,634,008
1 000 00064102401,097,000,008

(Compare to a real LevelDB write_buffer_size of 4 MiB or RocksDB's 64 MiB default — the table above shows you'd flush a 10K-entry buffer at about a megabyte.)

What an empty MemTable looks like

hexdump -C empty.bin
00000000  4d 4d 54 31 00 00 00 00                           |MMT1....|

8 bytes. size_bytes() returns 8. len() returns 0.

Iteration order corner cases

keys = ["", "\x00", "\x00\x00", "a", "ab", "b"]

Sorted byte-lex order:

""         (empty key — sorts first)
"\x00"
"\x00\x00"
"a"
"ab"
"b"

Empty keys are legal in this design (klen = 0). They are useful when the key is something like a single byte tag followed by an optional suffix.

Verification — db-05 LSM MemTable

Unit tests (per language)

IDTest nameWhat it asserts
V1empty_encode_decodeMemTable::new().encode() → 8 bytes MMT1\x00\x00\x00\x00; decode round-trips to an empty table.
V2put_then_getAfter put("k","v"), get("k") returns Value("v").
V3overwrite_replacesTwo puts on the same key keep only the latest value; len() stays at 1.
V4delete_writes_tombstoneAfter put("k","v") then del("k"), get("k") returns Tombstone (not None).
V5iter_byte_lex_orderInsert keys in random order; iteration yields them sorted byte-lex ("" first, \x00 next, etc.).
V6encode_decode_round_tripBuild a 50-entry table with a mix of values and tombstones; encode → decode → every entry matches and len() is preserved.
V7size_bytes_matches_encodeFor any table, size_bytes() == encode().len().
V8decoder_rejects_bad_magicdecode(b"XXX1...") returns Err.
V9decoder_rejects_truncationTruncate a valid dump at every byte boundary; decode must fail cleanly (no panic).
V10decoder_rejects_unsorted_keysHand-craft a dump where keys go ["b","a"]; decoder rejects.

Cross-language interop (scripts/cross_test.sh)

The same scripted scenario runs in each language:

new   → bulk 100 → put "key50" "REPLACED"
                → del "key10"
                → put "" "empty-key-value"
                → del "key99"
                → save

This produces dumps rust.bin, go.bin, cpp.bin. The script then:

  1. SHA-256s all three dumps. All must match — this is the byte-identical gate.
  2. 3×3 reader matrix. Every reader (rust/go/cpp) runs iter on every writer's dump. The lines must be identical across all 9 combinations.
  3. get spot-check. Each reader queries key50, key10, key99, "", and an absent key nonexistent; results must be value: 5245504c41434544 (REPLACED), tombstone, tombstone, value: 656d7074792d6b65792d76616c7565, absent respectively across all readers.

End-to-end verification (scripts/verify.sh)

bash scripts/verify.sh

Builds and tests all three languages, then runs the cross-test. Final line must be ALL GREEN.

Manual sanity checks

  • memtable new /tmp/m && wc -c /tmp/m → exactly 8 bytes.
  • memtable bulk /tmp/m 1000 && memtable size /tmp/m → matches the formula 8 + 1000 * (9 + len("keyN") + len("valN")) summed over N=0..999.
  • Hexdump the first 16 bytes of any dump and confirm magic + count.

What broken looks like

SymptomDiagnostic
decode accepts b"\x00\x00\x00\x00" (no magic check)Add magic test V8.
Two readers print different iter output for the same dumpEither type-byte misplaced, or one language is comparing by string instead of bytes (UTF-8 vs raw).
len() differs across langs after the same scriptGo's map+sort path lost a duplicate; check overwrite path.
Dump grows monotonically after delTombstone path is creating a new entry under a different key; check key equality.
Random crash in C++ on decode of truncated inputMissing length check before memcpy; bounds-check every read.

Broader Ideas — db-05 LSM MemTable

The MemTable in this lab is intentionally minimal. Real systems extend it in many directions; this doc maps the design space.

Concurrency-friendly structures

  • Skip list (LevelDB, RocksDB). Single writer + many readers, lock-free reads via memory-order acquire/release. The dominant choice for LSM MemTables.
  • Bw-tree (Hekaton). Lock-free B+ tree using delta records and a mapping table; shines on multi-writer workloads.
  • ART (Adaptive Radix Tree). Cache-friendly trie; very fast point lookups, used by HyPer, DuckDB, and recent CockroachDB internals for some indexes.
  • Masstree. Trie-of-B+trees; outperforms skip list on long variable keys.

Arena allocation

LevelDB's MemTable allocates all skip-list nodes from a bump-arena. Freeing is O(1) (drop the arena). RocksDB has a configurable Arena and a ConcurrentArena for parallel writes. Real benefit: less fragmentation and one cache-line probe per allocation. Our lab uses standard allocators because the lesson is the data layout, not the allocator.

Sequence numbers & MVCC

Production LSMs prepend a 64-bit sequence number (and an 8-bit type byte) to every internal key. Snapshot reads pick the latest sequence ≤ the snapshot's tag. db-13 revisits this when we add MVCC; here we collapse to last-write-wins.

Range tombstones

A single tombstone shadows one key. RocksDB's DeleteRange tombstones cover a key range [start, end) and live in a separate auxiliary structure inside the MemTable (RangeDelAggregator). This avoids exploding the MemTable size when bulk-deleting. Adding it would require:

  1. A RangeTombstone struct: (start, end, seqno).
  2. A second sorted container inside MemTable.
  3. get consults both: a key shadowed by an overlapping range tombstone returns Tombstone even if it has a Value entry.

Multiple MemTables (active + immutable list)

Production engines keep one active MemTable plus a list of immutable MemTables awaiting flush. Reads consult [active, ...immutables, L0, L1, ...] in order. Writers swap atomically (active → immutable + new empty active) when the size cap is hit. This decouples flush latency from write latency.

Write amplification interplay

The MemTable size cap (write_buffer_size) is the first knob in the LSM write amplification dial:

  • Larger MemTable → fewer, bigger L0 SSTables → less L0 compaction → lower write amp but slower recovery and more RAM.
  • Smaller MemTable → more L0 SSTables → more compaction work → higher write amp but fast recovery.

RocksDB and Cassandra default in the range 64–256 MiB; LevelDB defaults to 4 MiB.

Persistent MemTables (PMEM)

Intel Optane / CXL persistent memory blurs the WAL+MemTable boundary: the MemTable itself lives in persistent memory, so the WAL is unnecessary. Papers from VLDB 2018–2020 (NoveLSM, SLM-DB, FloDB) explore this.

Encryption

Cassandra and RocksDB optionally encrypt at-rest data including the MemTable's flushed SSTables. The MemTable itself is in RAM and inherits process-memory protection. Encrypting in-memory pages requires hardware support (SGX, AMD SEV).

Compression of in-memory entries

For very long values, RocksDB can compress values inside the MemTable using LZ4 or ZSTD via the MemTableRep's EncodeKey hook. Trades CPU for memory; useful when RAM is the limit.

Skip-list level distribution

Pugh's original skip list uses geometric level distribution with p=0.5 (max levels = log₂ n). LevelDB sets max levels = 12 and branching = 4; RocksDB defaults max = 16, branching = 4. Lower branching → taller list → more memory but better adaptivity.

Adversarial concerns

  • Memory amplification via tombstones. A flood of deletes can make the MemTable hold many entries with no live data; eventually all tombstones must propagate to SSTables and may take generations of compaction to GC.
  • Skew-induced flush storms. A hot key prefix can keep one MemTable bucket pinned while others empty; with hash-partitioned MemTables (HashSkipList) this is pronounced.

Beyond LSM

  • B-epsilon trees (TokuDB / Percona) batch writes inside internal B+ tree nodes; no separate MemTable.
  • Anti-caching (HyPer, VoltDB) keeps the working set in memory and evicts cold rows to disk; inverts the LSM model.
  • WiscKey decouples keys (LSM) from values (separate log) to slash write amplification for large values.

Step 01 — Sorted map + Entry type

Build the in-memory MemTable: a sorted associative container from byte-key to an Entry that is either Value(bytes) or Tombstone. Implement put, delete, get, iter, and len / size_bytes in all three languages with the same iteration order (byte-lex).

Why this first

The MemTable's iteration order is the contract that the on-disk format and the SSTable flush both depend on. If three languages disagree on order, every later step falls apart. So this step is a one-language-after-the-other implementation of the same BTreeMap-equivalent, with a shared unit test that inserts a permutation and checks the order.

Entry type

#![allow(unused)]
fn main() {
// Rust
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum Entry {
    Value(Vec<u8>),
    Tombstone,
}
}
// Go
type EntryType uint8
const (
    EntryValue EntryType = 0
    EntryTombstone EntryType = 1
)
type Entry struct {
    Type  EntryType
    Value []byte // empty if Tombstone
}
// C++
namespace dse::memtable {
    enum class EntryType : std::uint8_t { Value = 0, Tombstone = 1 };
    struct Entry {
        EntryType type;
        std::vector<std::uint8_t> value; // empty if Tombstone
    };
}

The type-byte numbering (0 = Value, 1 = Tombstone) is part of the on-disk contract — don't reorder it.

Container choice

#![allow(unused)]
fn main() {
// Rust — Vec<u8>'s Ord is byte-lex; BTreeMap iterates in key order
use std::collections::BTreeMap;
pub struct MemTable {
    map: BTreeMap<Vec<u8>, Entry>,
    bytes: usize,
}
}
// Go — unordered map; sort keys on iteration / encode
type MemTable struct {
    m     map[string]Entry
    bytes int
}

func (t *MemTable) sortedKeys() []string {
    keys := make([]string, 0, len(t.m))
    for k := range t.m {
        keys = append(keys, k)
    }
    sort.Strings(keys) // byte-lex on string is the same as on []byte
    return keys
}
// C++ — std::map's comparator is operator< on vector<uint8_t>, which is lex
class MemTable {
    std::map<std::vector<std::uint8_t>, Entry> map_;
    std::size_t bytes_ = 0;
};

put / delete

#![allow(unused)]
fn main() {
pub fn put(&mut self, key: &[u8], value: &[u8]) {
    self.bytes -= self.entry_bytes(key);
    self.map.insert(key.to_vec(), Entry::Value(value.to_vec()));
    self.bytes += self.entry_bytes(key);
}

pub fn delete(&mut self, key: &[u8]) {
    self.bytes -= self.entry_bytes(key);
    self.map.insert(key.to_vec(), Entry::Tombstone);
    self.bytes += self.entry_bytes(key);
}

fn entry_bytes(&self, key: &[u8]) -> usize {
    match self.map.get(key) {
        None => 0,
        Some(Entry::Value(v)) => 9 + key.len() + v.len(),
        Some(Entry::Tombstone) => 9 + key.len(),
    }
}
}

Go and C++ use the same accounting trick: subtract the old entry's contribution, update, add the new contribution.

iter

#![allow(unused)]
fn main() {
pub fn iter(&self) -> impl Iterator<Item = (&[u8], &Entry)> {
    self.map.iter().map(|(k, e)| (k.as_slice(), e))
}
}
func (t *MemTable) Iter() []KeyEntry {
    out := make([]KeyEntry, 0, len(t.m))
    for _, k := range t.sortedKeys() {
        out = append(out, KeyEntry{Key: []byte(k), Entry: t.m[k]})
    }
    return out
}
const std::map<std::vector<std::uint8_t>, Entry>& Iter() const noexcept {
    return map_;
}

Test — order determinism

Insert this permutation in each language and assert iteration yields the keys in the canonical sorted order:

inputs (insert order):  ["b", "a", "", "\x00\x00", "ab", "\x00"]
expected iter order:    ["", "\x00", "\x00\x00", "a", "ab", "b"]

This catches:

  • Go forgetting to sort.
  • C++ using std::map<std::string, ...> (where '\0' ends the string and breaks comparisons on binary keys).
  • Anyone using a hash map.

What to verify before moving on

  • put then get returns the value just written.
  • delete then get returns Tombstone (not absent).
  • Overwriting a key keeps len() at 1.
  • The permutation test above passes.
  • size_bytes() increases by exactly 9 + klen + vlen for each new key and stays flat when overwriting.

Step 02 — Encode / Decode the dump

Serialize the MemTable to a byte-identical on-disk layout and parse it back.

Layout (recap from analysis.md)

  magic   "MMT1"            (4 bytes)
  count   u32 LE            (4 bytes)
  repeat count times, in ascending key order:
      klen   u32 LE
      vlen   u32 LE         (0 for tombstone)
      type   u8             (0 = Value, 1 = Tombstone)
      key    klen bytes
      value  vlen bytes

Rust

#![allow(unused)]
fn main() {
pub fn encode(&self) -> Vec<u8> {
    let mut out = Vec::with_capacity(self.size_bytes());
    out.extend_from_slice(b"MMT1");
    out.extend_from_slice(&(self.map.len() as u32).to_le_bytes());
    for (k, e) in &self.map {
        let (vlen, t, v) = match e {
            Entry::Value(v) => (v.len() as u32, 0u8, v.as_slice()),
            Entry::Tombstone => (0, 1, &[][..]),
        };
        out.extend_from_slice(&(k.len() as u32).to_le_bytes());
        out.extend_from_slice(&vlen.to_le_bytes());
        out.push(t);
        out.extend_from_slice(k);
        out.extend_from_slice(v);
    }
    out
}

pub fn decode(bytes: &[u8]) -> Result<Self, Error> {
    if bytes.len() < 8 { return Err(Error::Short); }
    if &bytes[..4] != b"MMT1" { return Err(Error::BadMagic); }
    let count = u32::from_le_bytes(bytes[4..8].try_into().unwrap()) as usize;
    let mut p = 8usize;
    let mut t = MemTable::new();
    let mut prev: Option<Vec<u8>> = None;
    for _ in 0..count {
        if p + 9 > bytes.len() { return Err(Error::Short); }
        let klen = u32::from_le_bytes(bytes[p..p+4].try_into().unwrap()) as usize;
        let vlen = u32::from_le_bytes(bytes[p+4..p+8].try_into().unwrap()) as usize;
        let ty = bytes[p+8];
        p += 9;
        if p + klen + vlen > bytes.len() { return Err(Error::Short); }
        let key = bytes[p..p+klen].to_vec();
        p += klen;
        let val = bytes[p..p+vlen].to_vec();
        p += vlen;
        if let Some(ref pk) = prev {
            if key.as_slice() <= pk.as_slice() { return Err(Error::Unsorted); }
        }
        let entry = match ty {
            0 => { Entry::Value(val) }
            1 => { if vlen != 0 { return Err(Error::BadTombstone); } Entry::Tombstone }
            _ => return Err(Error::BadType),
        };
        prev = Some(key.clone());
        t.insert_raw(key, entry);
    }
    if p != bytes.len() { return Err(Error::Trailing); }
    Ok(t)
}
}

Go

func (t *MemTable) Encode() []byte {
    out := make([]byte, 0, t.SizeBytes())
    out = append(out, 'M', 'M', 'T', '1')
    out = binary.LittleEndian.AppendUint32(out, uint32(len(t.m)))
    for _, k := range t.sortedKeys() {
        e := t.m[k]
        out = binary.LittleEndian.AppendUint32(out, uint32(len(k)))
        out = binary.LittleEndian.AppendUint32(out, uint32(len(e.Value)))
        out = append(out, byte(e.Type))
        out = append(out, k...)
        out = append(out, e.Value...)
    }
    return out
}

Decode mirrors the Rust shape: read header, then loop reading klen, vlen, type, key, value, validating ascending key order and rejecting trailing bytes.

C++

std::vector<std::uint8_t> MemTable::Encode() const {
    std::vector<std::uint8_t> out;
    out.reserve(SizeBytes());
    static constexpr std::uint8_t magic[4] = {'M','M','T','1'};
    out.insert(out.end(), magic, magic + 4);
    PutU32LE(out, static_cast<std::uint32_t>(map_.size()));
    for (auto const& [k, e] : map_) {
        PutU32LE(out, static_cast<std::uint32_t>(k.size()));
        PutU32LE(out, static_cast<std::uint32_t>(e.value.size()));
        out.push_back(static_cast<std::uint8_t>(e.type));
        out.insert(out.end(), k.begin(), k.end());
        out.insert(out.end(), e.value.begin(), e.value.end());
    }
    return out;
}

What the decoder must reject

InputWhy it must fail
< 8 bytesheader truncated
magic XXXXbad format
count says 5 but only 3 entries fittruncated body
type byte 2unknown type
tombstone with vlen != 0malformed
keys not strictly ascendingviolates order invariant
trailing bytes after last entrycorruption

How to spot-check

After encoding a 2-entry MemTable (alpha=first, beta tombstoned):

xxd /tmp/m.bin | head
00000000: 4d4d 5431 0200 0000 0500 0000 0500 0000  MMT1............
00000010: 0061 6c70 6861 6669 7273 7404 0000 0000  .alphafirst.....
00000020: 0000 0001 6265 7461                      ....beta

Three things to verify by eye:

  1. 4d4d5431 = MMT1.
  2. 0200 0000 = count of 2 (LE).
  3. The tombstone's vlen is 0000 0000 and the byte before its key is 01.

Test V6 — round-trip 50 entries

Mix puts and deletes:

for i in 0..50:
    if i % 5 == 0: t.delete(format!("key{i}").as_bytes())
    else:          t.put(format!("key{i}").as_bytes(), format!("val{i}").as_bytes())
encoded = t.encode()
roundtrip = MemTable::decode(&encoded).unwrap()
assert_eq!(roundtrip.len(), t.len())
for (k, e) in t.iter() {
    assert_eq!(roundtrip.get(k), Some(e))
}
assert_eq!(roundtrip.encode(), encoded)

The last line is the idempotence check — decoding and re-encoding produces the same bytes. If it doesn't, we have non-determinism in iteration order, which will also break cross-language interop.

Step 03 — CLI + cross-language interop

Wrap the library in a uniform CLI and prove that all three implementations produce byte-identical dumps for the same scripted scenario.

The memtable binary

Every language exposes the same subcommands so the cross-test can drive them uniformly:

memtable new    PATH
memtable put    PATH KEY VALUE
memtable del    PATH KEY
memtable get    PATH KEY
memtable iter   PATH
memtable bulk   PATH N
memtable size   PATH
  • KEY and VALUE are passed as raw command-line strings. They may contain printable bytes; for testing we stick to ASCII to avoid shell quoting issues.
  • iter and get print hex (lowercase, no separator) so output is shell-safe.

Output formats

# iter
V <hex-key> <hex-value>
T <hex-key>

# get
value: <hex-value>
tombstone
absent

# size
entries=<N> size_bytes=<B>

The scripted scenario

scripts/cross_test.sh drives every language through this sequence:

new                                          # 8 bytes, empty
bulk 100                                     # 100 entries key0..key99 / val0..val99
put  key50  REPLACED                         # overwrite
del  key10                                   # tombstone
put  ""     empty-key-value                  # empty key as a valid key
del  key99                                   # tombstone at the tail

Then it dumps rust.bin, go.bin, cpp.bin and asserts:

shasum -a 256 rust.bin go.bin cpp.bin
# all three hashes must be identical

3×3 reader matrix

For every writer × reader combination, the script runs

$READER iter $WRITER.bin > out.${reader}.${writer}.txt

and diffs pairs of outputs. All nine outputs must agree byte-for-byte.

Why a bulk subcommand

Running 100 separate memtable put PATH key0 val0 … invocations would (a) thrash the disk and (b) test the CLI's argument parsing more than the data structure. bulk exists so the cross-test can build a non-trivial table in one process per language.

Spot-check get results

After the scenario the script also runs

get key50         # expect 'value: 5245504c41434544'   (REPLACED in hex)
get key10         # expect 'tombstone'
get key99         # expect 'tombstone'
get ""            # expect 'value: 656d7074792d6b65792d76616c7565'  (empty-key-value)
get nonexistent   # expect 'absent'

across all three readers.

Failure messages worth designing for

$ memtable get /tmp/m bogus
absent

$ memtable get /tmp/no-such-file foo
error: read /tmp/no-such-file: No such file or directory

$ memtable get /tmp/garbage.bin foo
error: bad magic

A consistent error vocabulary across languages makes the cross-test's grep patterns simpler.

Tying it together

scripts/verify.sh runs:

  1. Rust tests (cargo test --release).
  2. Go tests (go test ./...).
  3. C++ tests (cmake -S . -B build && cmake --build build && ctest).
  4. The cross-language script.

Final stdout must end with ALL GREEN.

SSTable Format

1. What Is It

A Sorted String Table (SSTable) is an immutable on-disk file holding key/value entries in byte-lex key order, organised into fixed-size blocks with an index block that maps each block's first key to its byte range inside the file, and a fixed-size footer that locates the index block.

The format in this lab:

+--------------------+   0
| data block 0       |
+--------------------+
| data block 1       |
+--------------------+
| ...                |
+--------------------+
| data block N-1     |
+--------------------+   index_offset
| index block        |
+--------------------+   file_size - 32
| footer (32 bytes)  |
+--------------------+   file_size

The footer always lives in the last 32 bytes and ends with the magic SST1\0\0\0\0, so any reader can validate the file with one pread of the tail and then pread the index block, and only then the relevant data block.

2. Why It Matters

  • Read-once, write-never. Each SSTable is written sequentially and then treated as read-only. That eliminates most concurrency hazards: lookups, range scans, and compactions can all share a single immutable file.
  • Bounded read amplification. A point lookup is footer → index → one data block. With a 4 KB block and a 16-byte average entry, ≤256 keys are scanned per lookup, regardless of file size.
  • Predictable I/O. Blocks are aligned write units; the OS page cache can pin hot blocks. Tail latency is dominated by exactly two I/Os per miss (index + data).
  • Foundation for LSM. Compaction merges multiple immutable SSTables into new immutable SSTables. The format is what makes "immutable + sorted + indexed" a usable storage primitive.

3. How It Works

3.1 Data block

A data block is a self-describing run of entries.

[count: u32 LE]
repeat count times (keys ascending byte-lex within the block):
  [klen: u32 LE][vlen: u32 LE][type: u8][key bytes][value bytes]

The writer flushes a block when its accumulated size would exceed a target (default 4096 bytes). The very first key of each block is the index key for that block.

3.2 Index block

[count: u32 LE]
repeat count times:
  [klen: u32 LE][offset: u64 LE][size: u64 LE][first-key bytes]

offset and size locate the data block inside the file. Index entries are listed in ascending block order, which is the same as ascending first-key order.

[index_offset: u64 LE]
[index_size:   u64 LE]
[num_blocks:   u64 LE]
[magic:        "SST1\0\0\0\0" (8 bytes)]

The fixed size makes the tail trivially locatable: pread(fd, buf, 32, file_size - 32).

3.4 Point lookup

  1. Read footer; verify magic.
  2. Read index block (index_offset..index_offset+index_size).
  3. Binary-search the index for the rightmost index entry whose first key ≤ target. That entry's block is the only one that can contain the target.
  4. Read that block; linear-scan within it.

A miss in step 3 (target < first entry's key) means the key is absent without reading any data block.

4. Core Terminology

TermDefinition
SSTableImmutable sorted on-disk K/V file.
BlockContiguous run of entries inside the SSTable; the I/O and indexing unit.
Data blockBlock containing user K/V entries.
Index blockBlock mapping each data block's first key to (offset, size).
FooterFixed-size tail (32 B) locating the index block; ends with a magic.
MagicSentinel byte pattern (SST1 here) used to validate file identity.
Index keyThe first key of a data block, copied into the index entry.
Block boundaryThe byte position where one data block ends and the next begins.
Restart point(Not used here; LevelDB-style intra-block delta-encoding marker.)
TombstoneEntry whose type=1 records that a key has been logically deleted.

5. Mental Models

  • Phone book. Data blocks are pages; the index is the alphabetical tabs on the side. The footer is the spine label saying "Volume 3 of 3".
  • Skiplist with one level. The index is a single coarse "level" above the sorted data; binary search on the index replaces a multi- level skiplist traversal.
  • Two-tier B+tree. Conceptually an SSTable is a 2-level B+tree whose leaves are data blocks and whose root is the index block — but built sequentially and frozen.

6. Common Misconceptions

  • "You need to scan the whole file to find a key." No — one index lookup pins a single block.
  • "The index must be at the start so you can read it first." No — the footer pointer makes the index location flexible, and writing the index last avoids buffering all keys before any data is flushed.
  • "Block size = block contents size." The on-disk block includes its own count header; the writer tracks an estimate of accumulated bytes so blocks land roughly on the target.
  • "Tombstones can be dropped at write time." Not safely — a tombstone must survive until no older SSTable can shadow it (handled by compaction in db-07).
  • "Binary search needs fixed-size index entries." The index entries here are variable-length, but the index block itself is small and fully loaded into RAM, where any search structure is cheap.

7. Interview Talking Points

  • Why is the footer at the end? ("So the writer can stream data blocks then index without two passes; one pread of the tail finds everything.")
  • What changes if you target 64 KB blocks instead of 4 KB? (Fewer index entries → smaller index → faster directory lookups; larger read amplification per miss; better compression ratios.)
  • How does this format become durable? (fsync after the footer is written, and a parent directory fsync so the dirent is recoverable. Without it a crash can leave the magic visible but data missing.)
  • What is bsearch looking for inside the index? (The floor — the largest first-key ≤ target — not equality.)
  • What stops a corrupt footer from poisoning the reader? (Magic check
    • length plausibility checks. Real systems add CRCs per block.)

8. Connections to Other Labs

  • db-05 MemTable supplies the sorted, in-memory K/V stream that this writer drains into blocks.
  • db-04 Bloom Filters can be attached per SSTable to skip the index lookup on negative queries (added in db-08 / db-09).
  • db-07 LSM Compaction consumes many SSTables and produces new ones using exactly this format.
  • db-08 Block Cache and Iterators caches the parsed data block rather than re-decoding on each lookup, and turns intra-block scans into iterators.
  • db-09 LevelDB Complete stitches MemTable + WAL + SSTable + compaction + bloom into a working engine.

References — SSTable Format

Papers

  • O'Neil, P., Cheng, E., Gawlick, D., O'Neil, E. "The Log-Structured Merge-Tree (LSM-Tree)." Acta Informatica 33(4), 1996. — Original description of the immutable run / multi-level merge architecture.
  • Chang, F. et al. "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006. — Introduces the SSTable term and the blocks-plus-index layout this lab mirrors.

Open-source implementations

  • LevelDB table/format.h, table/table_builder.cc, table/block_builder.cc, table/block.cc — canonical reference for this lab. The data block format here is the LevelDB block format with restart-point compression removed for clarity.
  • RocksDB table/block_based/block_based_table_builder.cc — adds bloom-filter blocks, compression, and partitioned indices on top of the same skeleton.
  • CockroachDB Pebble sstable/writer.go and sstable/reader.go — Go implementation in idiomatic style.

Articles

Analysis — SSTable Format

Problem

Take a sorted stream of K/V (and tombstone) entries — exactly what db-05 produces — and persist it as an immutable, randomly-readable file:

  • writing is one sequential pass (no re-reads, no buffering all keys);
  • a point lookup costs O(log N) on the index plus one block read;
  • the file is self-describing: a reader can validate and navigate it using only the file itself.

Constraints

  • 4 KB target data-block size (close to a page; tunable).
  • Little-endian integers throughout (matches db-03 / db-05).
  • No per-block CRCs in this lab — added in db-21 ("Storage Engine Advanced"). The footer magic is the only integrity gate.
  • No compression, no delta-encoded keys: the goal is a format simple enough to compare byte-for-byte across three languages.
  • Cross-language interop: Rust, Go, and C++ MUST emit byte-identical SSTables for the same input MemTable.

Design

Stream-once writer

sst_writer.add(key, type, value) -> writes into current data block buffer
sst_writer.finish() -> flushes the current block, writes index, writes footer

The writer accumulates one block in memory at a time. When adding an entry would push the encoded block size past 4096 bytes, the current block is flushed and a new one started. The first key of every block is captured into an IndexEntry { key, offset, size }.

Index sizing

Index entries are ~ (4 + 8 + 8 + k̄) = 20 + k̄ bytes; for k̄ = 16 and ~250 entries per 4 KB block, a 1 GB SSTable carries ~262 144 blocks × 36 B ≈ 9 MB of index — small enough to keep in RAM per open SSTable.

Lookup

fn get(key) -> Option<Entry>:
    footer = read_tail(32)
    assert footer.magic == "SST1\0\0\0\0"
    index = read(footer.index_offset, footer.index_size)
    blk = bsearch_floor(index, key)?                # None => below smallest
    block = read(blk.offset, blk.size)
    return linear_scan(block, key)

bsearch_floor is the rightmost index entry whose first key ≤ target. Returning None (target precedes the smallest first-key) is a fast miss without reading any data block.

Per-language container choice

LanguageWriter bufferIndex repr
RustVec<u8> for the current blockVec<IndexEntry>
Go[]byte[]IndexEntry
C++std::vector<uint8_t>std::vector<IndexEntry>

IndexEntry is (Vec<u8> key, u64 offset, u64 size) in all three.

Build-from-memtable bridge

For cross-test friendliness, the writer's input source is a decoded MemTable dump (the output of db-05 encode). The CLI command build reads IN.mt, iterates in sorted order, and emits OUT.sst.

What could break

  • Block boundary drift. If two implementations disagree on when to flush a block (e.g. > 4096 vs >= 4096), the data blocks land at different offsets, the index differs, and the footer hashes differ. We pin the rule: *flush when `current_block_size + next_entry_size

    4096ANDcurrent_block_size > 0`*.

  • Index encoding for the very first block. Its first key may be the empty string ""; the index entry then has klen=0. The reader must still treat it as the floor for any non-empty target.
  • Footer alignment. Anything other than exactly 32 bytes after the index block invalidates the magic offset.

Execution — SSTable Format

Library API (uniform across Rust / Go / C++)

struct Entry { type: 0|1; value: bytes }     // tombstone => value == empty
struct IndexEntry { key: bytes; offset: u64; size: u64 }
struct Footer { index_offset: u64; index_size: u64; num_blocks: u64; magic: "SST1\0\0\0\0" }

const BLOCK_TARGET: usize = 4096
const FOOTER_LEN:   usize = 32
const MAGIC:        &[u8; 8] = b"SST1\0\0\0\0"

// ---- writer ----
SstWriter::new(target_block_size = BLOCK_TARGET)
SstWriter::add(&mut self, key: &[u8], entry: Entry)   // keys MUST be strictly ascending
SstWriter::finish(&mut self) -> Vec<u8>               // returns full SSTable bytes

// ---- reader ----
SstReader::open(bytes: &[u8]) -> Result<Self, Error>
SstReader::len(&self) -> usize                         // num entries
SstReader::num_blocks(&self) -> usize
SstReader::get(&self, key: &[u8]) -> Option<Entry>     // None if absent OR tombstone is not skipped
SstReader::iter(&self) -> impl Iterator<Item=(&[u8], Entry)>  // full file scan

Error variants: Short, BadMagic, BadBlock, Unsorted, BadTombstone, BadType, IndexOutOfRange.

CLI

The binary is named sstable in every language and dispatches on the first arg:

sstable build  IN.mt OUT.sst         # read MemTable dump, write SSTable
sstable footer FILE.sst              # print: index_offset=... index_size=... num_blocks=... magic_ok=...
sstable get    FILE.sst KEY          # prints: value: <hex> | tombstone | absent
sstable iter   FILE.sst              # prints lines: V <hex-key> <hex-value> | T <hex-key>
sstable size   FILE.sst              # prints: file_bytes=B entries=N num_blocks=K

Output formats match db-05 deliberately so the same cross-test helpers (hex iter, value:/tombstone/absent get) apply.

Worked example

Given memtable bulk M.mt 100 && memtable put M.mt key50 REPLACED && memtable del M.mt key10, calling sstable build M.mt OUT.sst does:

  1. Decode M.mt (MemTable format from db-05).
  2. Iterate in sorted order; for each entry, call writer.add.
  3. The writer accumulates entries into a 4096-byte data-block buffer. When the next entry would overflow, it flushes the buffer:
    • records IndexEntry { key = first_key_of_block, offset, size },
    • appends the encoded block to the output stream,
    • resets the buffer with the just-added entry.
  4. After the last add, finish flushes the final block, then writes the index block, then a 32-byte footer ending in SST1\0\0\0\0.

The output file is then self-validating: sstable footer OUT.sst prints the footer values, sstable iter OUT.sst reproduces every entry in sorted order, and sstable get OUT.sst key50 returns value: 5245504c41434544.

Observation — SSTable Format

Smallest possible SSTable

Build from an empty MemTable: zero entries, zero data blocks, an empty index, and just the footer.

file size: 0 + 4 (index count=0) + 32 (footer) = 36 bytes

Hex (annotated):

offset
0000:  00 00 00 00                                          # index block: count=0
0004:  00 00 00 00 00 00 00 00   index_offset = 0
000c:  04 00 00 00 00 00 00 00   index_size   = 4
0014:  00 00 00 00 00 00 00 00   num_blocks   = 0
001c:  53 53 54 31 00 00 00 00   magic        = "SST1\0\0\0\0"

File-size formula

For a build with N entries spread across K data blocks where the sum of key sizes is Σk and the sum of value sizes (only for non-tombstone entries) is Σv:

data_bytes  = Σ_blocks ( 4 + Σ_entries_in_block (9 + k + v) )
            = 4·K + N·9 + Σk + Σv
index_bytes = 4 + Σ_blocks ( 4 + 8 + 8 + first_key_len )
            = 4 + K·20 + Σ_block_first_key_lens
file_bytes  = data_bytes + index_bytes + 32

(The 4-byte per-block header is the entry count. The 20-byte per-index-entry overhead is klen u32 + offset u64 + size u64.)

Hex walkthrough of a 3-entry SSTable

Three small entries forced into one block by the small block target — e.g. put a 1, put bb 22, del ccc:

00000000  03 00 00 00                       count=3
00000004  01 00 00 00 01 00 00 00 00 'a' '1'              # entry 1: klen=1 vlen=1 type=0 "a" "1"
00000011  02 00 00 00 02 00 00 00 00 'b' 'b' '2' '2'      # entry 2: klen=2 vlen=2 type=0 "bb" "22"
0000001e  03 00 00 00 00 00 00 00 01 'c' 'c' 'c'          # entry 3: klen=3 vlen=0 type=1 "ccc"

00000028  01 00 00 00                       # index count=1
0000002c  01 00 00 00 00 00 00 00 00 00 00 00 28 00 00 00 00 00 00 00 'a'   # klen=1 offset=0 size=0x28 "a"

00000048  00 00 00 00 00 00 00 00           # footer.index_offset = 0x28
00000050  19 00 00 00 00 00 00 00           # footer.index_size   = 0x19
00000058  01 00 00 00 00 00 00 00           # footer.num_blocks   = 1
00000060  53 53 54 31 00 00 00 00           # magic "SST1\0\0\0\0"
                                            #   (file size = 0x68 = 104 bytes)

Note that the first key of the single block is "a", so the index entry copies that key.

What broken looks like

SymptomLikely cause
BadMagic at openfile truncated, or footer overwritten by an interrupted writer.
BadBlock reading a blockblock size in the index disagrees with the in-file count header — e.g. wrong endianness.
Two languages produce different file sizes for identical inputblock-flush rule mismatch (> vs >=).
Unsorted from the writercaller didn't iterate the MemTable in sorted order before add.
IndexOutOfRange at readcorrupted offset/size in the index — checked against file_len - 32 to fail loudly.

Verification — SSTable Format

V1: empty build

build from an empty MemTable produces a 36-byte file ending in SST1\0\0\0\0 with index_offset=0, index_size=4, num_blocks=0.

V2: single-entry build

add("k", Value("v"))finish yields a file that:

  • contains exactly one data block,
  • has one index entry with key "k",
  • round-trips: iter returns [("k", Value("v"))], get("k") returns the value, get("missing") returns None.

V3: tombstones survive

A tombstone added during build is reported as Some(Entry::Tombstone) by get and as T <hex-key> by iter.

V4: ascending-key precondition

Calling add with a key that is not strictly greater than the previous added key MUST return Unsorted and leave no partial output.

V5: block-boundary rule

With target_block_size = 64 and inputs whose encoded sizes are known, the writer flushes the running block as soon as adding the next entry would push its size strictly greater than 64 bytes. A test inserts entries crafted so that the second insert is the boundary-crossing one and asserts the resulting file has exactly two data blocks and two index entries.

For any file produced by finish:

  • the last 32 bytes parse as a Footer,
  • magic == "SST1\0\0\0\0",
  • index_offset + index_size + 32 == file_size,
  • num_blocks matches the count of index entries.

V7: bulk round-trip vs MemTable

Take a MemTable populated by bulk 100 + put key50 REPLACED + del key10 + put "" empty-key-value + del key99, build an SSTable from it, then verify that iter-over-SSTable returns the exact same (key, entry) sequence as iter-over-MemTable. Per-key get agrees too.

V8: floor lookup correctness

For a multi-block SSTable, get(target) returns the matching entry when present and None when the target falls between blocks, even though the index entry it lands on belongs to the preceding block.

V9: reader rejects bad magic

A file with the last 8 bytes mutated away from SST1\0\0\0\0 MUST return BadMagic on open.

V10: reader rejects out-of-range index pointer

A file whose footer claims index_offset >= file_size - 32 MUST return IndexOutOfRange on open (caught before any block read).

Broader Ideas — SSTable Format

  • Restart points and prefix compression. LevelDB stores keys as (shared_prefix_len, unshared_suffix, value) and resets the prefix every N entries (a "restart point"). The block trailer lists the restart offsets so binary search inside a block is still O(log N_restarts). Halves on-disk size for sorted key sets but couples decode to encode order.
  • Two-level / partitioned index. A 1 TB SSTable would have ~10 GB of index entries. Partitioning the index into "index blocks of index blocks" keeps the resident index small at the cost of one extra pread per miss. RocksDB uses this above ~2 GB SSTables.
  • Per-block bloom filters. Attaching a small Bloom (db-04) to each data block lets the reader skip the entire block on a miss without decoding it. Trades index/Bloom RAM for fewer block reads.
  • Block CRCs / per-block compression. Real engines write [data][type byte: compression][crc32c] per block; the writer computes the CRC over compressed bytes. Detects bit-rot at read time but adds CPU cost per block.
  • Streaming writer to disk. This lab returns Vec<u8>; production writers stream blocks into an os.File and only buffer the index in RAM. With a 4 KB block target and 250 entries/block, peak RAM is ~4 KB + the growing index.
  • Min/max keys per block in the index. Index entries can carry the last key too, so a query strictly between two blocks short-circuits without reading either. Costs ~2× index size.
  • Splitting hot blocks. Some engines (e.g. CockroachDB) measure read frequency per block and adaptively shrink hot blocks to reduce read amplification on small lookups.
  • Versioned magic. A future format change (e.g. adding bloom blocks) bumps the magic to SST2; readers can keep both code paths and choose at open time. Cheap, common practice.

Step 01 — Data Block Writer

Goal

Implement the smallest unit of an SSTable: a data block builder that accumulates entries and emits the on-disk block bytes.

Block format (recap)

[count: u32 LE]
repeat count times (keys ascending within the block):
  [klen: u32 LE][vlen: u32 LE][type: u8][key][value]

The block does not carry its own size — the index entry that points to it does.

Encoded entry size

entry_size(klen, vlen) = 9 + klen + vlen   # 4 + 4 + 1 + key + value

A block that holds N entries occupies 4 + Σ entry_size.

Flush rule

Track current_size = 4 (the block header) and the buffer separately. For each candidate entry (k, v):

sz = entry_size(len(k), len(v))
if buffer_non_empty AND current_size + sz > BLOCK_TARGET:
    flush()                # emit block, capture index entry, reset
push entry
current_size += sz

This rule allows the block to grow up to and including BLOCK_TARGET bytes but never beyond. A single oversized entry is emitted alone in its own block (block size grows past the target only when one entry already exceeds it).

Side-by-side: Rust / Go / C++

Rust

#![allow(unused)]
fn main() {
const HEADER: usize = 4;
fn entry_size(k: usize, v: usize) -> usize { 9 + k + v }

struct BlockBuilder {
    buf: Vec<u8>,
    count: u32,
    first_key: Option<Vec<u8>>,
}

impl BlockBuilder {
    fn new() -> Self {
        let mut buf = Vec::with_capacity(BLOCK_TARGET);
        buf.extend_from_slice(&0u32.to_le_bytes()); // placeholder for count
        Self { buf, count: 0, first_key: None }
    }
    fn current_size(&self) -> usize { self.buf.len() }
    fn add(&mut self, key: &[u8], ty: u8, value: &[u8]) {
        if self.count == 0 { self.first_key = Some(key.to_vec()); }
        self.buf.extend_from_slice(&(key.len() as u32).to_le_bytes());
        self.buf.extend_from_slice(&(value.len() as u32).to_le_bytes());
        self.buf.push(ty);
        self.buf.extend_from_slice(key);
        self.buf.extend_from_slice(value);
        self.count += 1;
    }
    fn finish(mut self) -> (Vec<u8>, Vec<u8>) {
        self.buf[0..4].copy_from_slice(&self.count.to_le_bytes());
        (self.buf, self.first_key.unwrap_or_default())
    }
}
}

Go

type blockBuilder struct {
    buf      []byte
    count    uint32
    firstKey []byte
}

func newBlock() *blockBuilder {
    b := &blockBuilder{buf: make([]byte, 0, blockTarget)}
    b.buf = binary.LittleEndian.AppendUint32(b.buf, 0) // placeholder
    return b
}
func (b *blockBuilder) currentSize() int { return len(b.buf) }
func (b *blockBuilder) add(key []byte, ty byte, value []byte) {
    if b.count == 0 { b.firstKey = append([]byte(nil), key...) }
    b.buf = binary.LittleEndian.AppendUint32(b.buf, uint32(len(key)))
    b.buf = binary.LittleEndian.AppendUint32(b.buf, uint32(len(value)))
    b.buf = append(b.buf, ty)
    b.buf = append(b.buf, key...)
    b.buf = append(b.buf, value...)
    b.count++
}
func (b *blockBuilder) finish() (block, firstKey []byte) {
    binary.LittleEndian.PutUint32(b.buf[0:4], b.count)
    return b.buf, b.firstKey
}

C++

struct BlockBuilder {
    std::vector<uint8_t> buf;
    uint32_t count = 0;
    std::vector<uint8_t> first_key;

    BlockBuilder() {
        buf.reserve(kBlockTarget);
        put_u32_le(buf, 0);                // placeholder
    }
    size_t current_size() const { return buf.size(); }
    void add(const uint8_t* k, size_t klen,
             uint8_t ty,
             const uint8_t* v, size_t vlen) {
        if (count == 0) first_key.assign(k, k + klen);
        put_u32_le(buf, uint32_t(klen));
        put_u32_le(buf, uint32_t(vlen));
        buf.push_back(ty);
        buf.insert(buf.end(), k, k + klen);
        buf.insert(buf.end(), v, v + vlen);
        ++count;
    }
    std::pair<std::vector<uint8_t>, std::vector<uint8_t>> finish() {
        std::memcpy(buf.data(), &count, 4); // LE on the platforms we target
        return {std::move(buf), std::move(first_key)};
    }
};

(For portability, the C++ version uses put_u32_le to patch the count header too in the real implementation; the memcpy shortcut works on little-endian hosts but the lab uses the helper.)

Self-check

  • Empty finish returns (b"\x00\x00\x00\x00", b"").
  • After three adds the buffer length equals 4 + Σ entry_size.
  • first_key is captured on the first add and never overwritten.

Step 02 — Writer, Index, Footer

Goal

Wire the data-block builder into a full SstWriter that emits [blocks...][index][footer].

Writer state

SstWriter {
    out: Vec<u8>,            // file bytes accumulated so far
    block: BlockBuilder,     // current data block
    index: Vec<IndexEntry>,  // one entry per flushed block
    target: usize,           // BLOCK_TARGET (default 4096)
    last_key: Option<Vec<u8>>,
}

add

fn add(&mut self, key: &[u8], ty: u8, value: &[u8]) -> Result<(), Error> {
    if let Some(prev) = &self.last_key {
        if key <= prev.as_slice() { return Err(Error::Unsorted); }
    }
    if ty == 1 && !value.is_empty() { return Err(Error::BadTombstone); }
    let sz = entry_size(key.len(), value.len());
    if self.block.count > 0 && self.block.current_size() + sz > self.target {
        self.flush_block();
    }
    self.block.add(key, ty, value);
    self.last_key = Some(key.to_vec());
    Ok(())
}

flush_block

fn flush_block(&mut self) {
    let mut blk = std::mem::replace(&mut self.block, BlockBuilder::new());
    let (bytes, first_key) = blk.finish();
    let offset = self.out.len() as u64;
    let size   = bytes.len() as u64;
    self.out.extend_from_slice(&bytes);
    self.index.push(IndexEntry { key: first_key, offset, size });
}

finish

fn finish(mut self) -> Vec<u8> {
    if self.block.count > 0 { self.flush_block(); }
    let index_offset = self.out.len() as u64;
    self.out.extend_from_slice(&(self.index.len() as u32).to_le_bytes());
    for e in &self.index {
        self.out.extend_from_slice(&(e.key.len() as u32).to_le_bytes());
        self.out.extend_from_slice(&e.offset.to_le_bytes());
        self.out.extend_from_slice(&e.size.to_le_bytes());
        self.out.extend_from_slice(&e.key);
    }
    let index_size = self.out.len() as u64 - index_offset;
    let num_blocks = self.index.len() as u64;
    self.out.extend_from_slice(&index_offset.to_le_bytes());
    self.out.extend_from_slice(&index_size.to_le_bytes());
    self.out.extend_from_slice(&num_blocks.to_le_bytes());
    self.out.extend_from_slice(b"SST1\0\0\0\0");
    debug_assert_eq!(self.out.len() as u64,
                     index_offset + index_size + FOOTER_LEN as u64);
    self.out
}
fn parse_footer(file: &[u8]) -> Result<Footer, Error> {
    if file.len() < FOOTER_LEN { return Err(Error::Short); }
    let tail = &file[file.len() - FOOTER_LEN..];
    if &tail[24..32] != b"SST1\0\0\0\0" { return Err(Error::BadMagic); }
    Ok(Footer {
        index_offset: u64::from_le_bytes(tail[ 0.. 8].try_into().unwrap()),
        index_size:   u64::from_le_bytes(tail[ 8..16].try_into().unwrap()),
        num_blocks:   u64::from_le_bytes(tail[16..24].try_into().unwrap()),
    })
}

The reader then verifies footer.index_offset + footer.index_size + 32 == file.len() (returns IndexOutOfRange otherwise) and parses the index block.

Index block parse

Identical structure to write:

let mut p = footer.index_offset as usize;
let count = read_u32_le(&file[p..]); p += 4;
let mut idx = Vec::with_capacity(count as usize);
for _ in 0..count {
    let klen = read_u32_le(&file[p..]) as usize;            p += 4;
    let off  = read_u64_le(&file[p..]);                     p += 8;
    let sz   = read_u64_le(&file[p..]);                     p += 8;
    let key  = file[p..p+klen].to_vec();                    p += klen;
    if off + sz > footer.index_offset {                     // beyond data region
        return Err(Error::IndexOutOfRange);
    }
    idx.push(IndexEntry { key, offset: off, size: sz });
}
fn get(&self, key: &[u8]) -> Option<Entry> {
    // Floor = rightmost index entry whose first_key <= key.
    let pos = match self.index.binary_search_by(|e| e.key.as_slice().cmp(key)) {
        Ok(i)  => i,                  // exact first-key match
        Err(0) => return None,        // key precedes the smallest block
        Err(i) => i - 1,
    };
    let blk = &self.index[pos];
    let block_bytes = &self.file[blk.offset as usize
                                .. (blk.offset + blk.size) as usize];
    scan_block(block_bytes, key)
}

scan_block decodes entries in order and returns the matching one when found, None otherwise (a hit on a tombstone returns Some(Entry::Tombstone) — the engine layer decides what tombstones mean).

Self-check

  • An empty writer: finish() length is exactly 36 (4-byte empty index
    • 32-byte footer).
  • After one add, file length = 4 + 9 + |k| + |v| (block) + 4 + 4 + 8 + 8 + |k| (index) + 32 (footer).
  • For a target of 64 and entries crafted with sizes 30, 30, 30: the first add fits (4+30=34 ≤ 64), the second triggers flush (34+30=64? — equals target, no flush; 34+30=64 ≤ 64), then the third (64+30=94 > 64) → flush; result: two data blocks.

Step 03 — CLI and Cross-Language Test

CLI surface

sstable build  IN.mt OUT.sst        # MemTable dump in → SSTable out
sstable footer FILE.sst              # prints footer values + magic_ok
sstable get    FILE.sst KEY          # value: <hex> | tombstone | absent
sstable iter   FILE.sst              # V <hex-key> <hex-value> | T <hex-key>
sstable size   FILE.sst              # file_bytes=B entries=N num_blocks=K

The hex-encoding and value: / tombstone / absent strings match db-05 so the cross-test reuses the same comparison logic.

Cross-test scenario

Identical input across all three languages:

memtable new        M.mt
memtable bulk       M.mt 100
memtable put        M.mt key50 REPLACED
memtable del        M.mt key10
memtable put        M.mt ""     empty-key-value
memtable del        M.mt key99
sstable  build      M.mt OUT.sst

Cross-test checks:

  1. Byte identity. sha256 of OUT.sst matches across rust / go / c++. (Same input MemTable dump + same writer rules ⇒ same bytes.)
  2. 3×3 iter matrix. Every reader can iterate every writer's output, producing identical line-by-line dumps.
  3. 3×3 footer parse. sstable footer OUT.sst from every reader on every writer's output reports the same index_offset / index_size / num_blocks and magic_ok=true.
  4. Spot-check get. For each language: get key50value: 5245504c41434544, get key10tombstone, get ""value: 656d7074792d6b65792d76616c7565, get nopeabsent.
  5. Iter equivalence vs MemTable. sstable iter OUT.sst matches memtable iter M.mt byte-for-byte (the SSTable preserves the sorted entry stream, including tombstones).

Block-boundary check

With 100 small entries (key0..key99val0..val99, encoded ≈ 16 bytes each), a 4096-byte block target produces roughly 100 / (4096 / 16) ≈ 1 data blocks but with the +9 overhead per entry it lands at 1 or 2 blocks. The cross-test asserts only that num_blocks ≥ 1 and that every reader agrees on the count.

A separate sub-test forces a small block target (64 bytes) on identical input across the three languages and asserts the resulting num_blocks value matches; this is the precise boundary-rule check.

Output formats (exact strings)

CommandFormat
footerindex_offset=<N> index_size=<N> num_blocks=<N> magic_ok=<true|false>
getvalue: <hex> | tombstone | absent
iter valueV <hex-key> <hex-value>
iter tombstoneT <hex-key>
sizefile_bytes=<N> entries=<N> num_blocks=<N>

The cross-test scripts diff these as plain text.

db-07: LSM Compaction

0. Why compaction at all?

The LSM write path (db-05 MemTable + db-06 SSTable) is intentionally append-only. When a key is updated, the new version is written to a fresh MemTable and later flushed to a fresh SSTable; the old version is still sitting in some older SSTable. When a key is deleted, a tombstone is written, not a removal.

Without compaction, three pathologies grow without bound:

PathologySymptomBound
Read amplificationA single get() must check every live SSTable.O(#SSTables)
Space amplificationObsolete versions and tombstones keep occupying disk.Total writes / live bytes
Index/metadata bloatReader has to load every SSTable's index.O(#SSTables)

Compaction merges N input SSTables into M output SSTables, applying newest wins semantics and (eventually) purging tombstones. It trades extra write I/O (write amplification) for bounded read and space amplification.

1. The two strategies (one sentence each)

  • Leveled (LevelDB, RocksDB default): level L holds at most ~10× the bytes of level L-1. When a level is full, you pick one file and compact it against the overlapping files in L+1. Read amp ≈ #levels; space amp ≈ 1.1×.
  • Tiered (Cassandra default, Pebble's "level 0"): when level L has K files, merge all of them into a single L+1 file. Read amp ≈ #levels × K; space amp can be 2–3×; write amp is much lower.

This lab implements neither policy. It implements the mechanism they both need: a correct K-way merge that respects recency ordering and tombstones. Picking the policy is a separate problem (and a configurable one).

2. The mechanism: K-way merge

Inputs: an ordered list of SSTables [A, B, C, ...], where A is the newest (most recently written) and the rest follow in age order.

Output: a single SSTable whose entries are the sorted union of all input keys, where for any key k the entry is taken from the first input that contains it. Tombstones are entries — they win against older values just like a put.

The merge is a textbook K-way merge:

  1. Open all inputs and produce per-input cursors that iterate in key order.
  2. Push each cursor's current key into a min-heap keyed by (key, source_index). source_index is the recency tiebreaker — smaller index = newer.
  3. Pop the smallest. This is the next unique key and its winning entry.
  4. Emit it (subject to the tombstone-drop rule below).
  5. Advance the popped cursor. Also advance every other cursor whose current key equals the just-emitted key (they are stale duplicates).
  6. Repeat until the heap is empty.

In a min-heap with K cursors and N total entries the merge is O(N log K).

3. Newest-wins semantics

The contract:

  • Inputs are ordered by recency. Index 0 is newest.
  • For each distinct key, the first input that contains it wins.
  • The winning entry's type (Value or Tombstone) is preserved.
  • All other versions of that key are discarded.

This matches what a layered reader would do on a get() query if it walked the levels top-down and short-circuited on the first hit.

4. Tombstone purging

A tombstone exists to hide an older version of a key. It is safe to drop a tombstone if and only if there is no older version anywhere that the tombstone could be hiding.

Two cases:

  • Compacting the bottom level. There is nothing older. Every tombstone whose only remaining copy is in the output is safe to drop. Callers signal this with drop_tombstones=true.
  • Compacting a non-bottom level. Even if no input has an older version of the key, a deeper level still might. Tombstones must be kept. Callers leave drop_tombstones=false.

This lab exposes the flag and trusts the caller. A real engine wires it from the level metadata.

5. What this lab does NOT do (and why)

  • No splitting: the output is a single SSTable. Production engines cap output file size to keep per-file work bounded. The merge logic is the same; splitting is an output-side concern handled by switching SstWriter targets.
  • No level metadata: there is no notion of "this output belongs to level N". That belongs to a manifest / version-edit log, which is db-09 territory.
  • No deletion of obsolete inputs: the caller is responsible for unlinking the input files once the output is durable. We just return bytes.
  • No checksums or atomic rename: writing-then-renaming and checksumming blocks belong in db-08+.

6. Cross-language contract

The output is a db-06 SSTable. Two implementations that compact the same inputs in the same order with the same drop_tombstones flag must produce byte-identical outputs. We verify this with sha256 across rust/go/cpp.

7. Failure modes worth recognizing

BugSymptom
Wrong recency tiebreaker (older wins on ties)After compaction, a recently-overwritten key reverts.
Forgetting to advance non-winning duplicatesSame key appears multiple times in output → SstWriter errors.
Comparing keys as strings (UTF-8) not bytesNon-ASCII keys order wrong; cross-lang sha256 diverges.
Dropping tombstones when not at bottomDeleted keys reappear from a deeper level.
Emitting an empty block instead of empty SSTableFile size ≠ 36 for empty merge; reader rejects.

8. Hand-trace template (the smallest interesting example)

Inputs (newest first):

A: [("a",V,"1"), ("b",T)]
B: [("a",V,"0"), ("c",V,"9")]

Step-by-step heap state and emit:

stepheap (key,src)popemitnotes
0(a,A) (a,B)(a,A)a → V "1"also advance B past "a"
1(b,A) (c,B)(b,A)b → Ttombstone preserved
2(c,B)(c,B)c → V "9"A is exhausted
3(empty)--done

Output: [("a",V,"1"), ("b",T), ("c",V,"9")] — 3 distinct keys, A's versions of a and b win, c comes from B.

db-07 references

Foundational

  • O'Neil, P. et al. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 1996. The original. Read sections 3–4 for the merge/rolling-merge mechanism.
  • Chang, F. et al. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006. Section 5.3 ("compactions") frames minor vs. major compactions on top of SSTables.

Engineering, read these

Curriculum companions

Algorithm

  • K-way merge with a min-heap: any algorithms textbook. The pattern here is identical to "merge K sorted lists" with an extra rule for duplicate keys.

db-07 Analysis

Surface area

The lab exposes one library function and one CLI:

compact(inputs: ordered list of SSTable bytes, drop_tombstones: bool) -> SSTable bytes

inputs[0] is the newest. Empty input list returns an empty SSTable (36 bytes, identical to SstWriter::new().finish() from db-06).

CLI:

compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...

State machine of the merge

The merger holds K cursors, one per input. Each cursor is a sequence of (key, entry) pairs in sorted key order, produced by iterating the input SSTable's blocks in order.

A min-heap holds at most K entries, each (current_key, source_index). source_index is the position in inputs (smaller = newer).

State transitions:

init:    push each non-empty cursor's first (key, src) into heap
step:    pop top (key=k, src=i)
         take entry from cursor i, advance cursor i
         for every other cursor j whose current key == k: advance cursor j
         re-push any cursor that still has items (only those that advanced past k)
         emit (k, entry) unless (entry is Tombstone AND drop_tombstones)
done:    when heap is empty

The "advance every cursor whose current key == k" rule is what makes the merge deduplicating. It is the only subtle bit. Forget it and SstWriter rejects the output with Unsorted because the same key reappears.

Containers per language

  • Rust: BinaryHeap<Reverse<(Vec<u8>, usize)>> — pop smallest by key, ties broken by source index (smaller = newer = wins). Cursors are IntoIter over pre-materialized Vec<(Vec<u8>, Entry)> from SstReader::entries().
  • Go: container/heap with a struct slice. Same ordering. Cursors are index counters into []Entry.
  • C++: std::priority_queue with custom comparator that flips to min-heap. Cursors are std::vector<...>::const_iterator pairs.

Materializing all entries up front is wasteful for huge SSTables but is fine for this lab and keeps the three implementations symmetric. A streaming reader is the next step (db-08 block-cache and iterators).

What's intentionally not optimized

  • We materialize entries instead of streaming blocks. This avoids needing a block-by-block iterator API on db-06's SstReader, which would couple the two labs more tightly than the curriculum wants at this stage.
  • We use a single output SSTable. Output splitting is one if-statement in the emit step (flush + start new SstWriter when size exceeds N). Doing it here would force a "list of outputs" API that doesn't matter for byte-identity.
  • We do not parallelize. K-way merge is trivially serial; partitioning is a policy concern that belongs above this layer.

What could break the cross-language byte-identity

  1. Tiebreaker inconsistency between heap implementations. Pin it: for two equal keys, the cursor with the smaller source index wins. All three implementations must agree on this exactly.
  2. Comparing keys as language-native strings (UTF-8 ordering). All three must compare as byte slices (Vec<u8> / []byte / std::vector<uint8_t>).
  3. Forgetting to advance non-winning duplicates. Output will contain repeats; SstWriter from db-06 will reject with Unsorted. Good — fail loud.
  4. Different block-target sizes. We always use the db-06 default (4096) so the output is a single block for the canonical scenario.

Verification plan in one line

Build two distinct memtables (newer + older), promote each to an SSTable using db-06, run compact [newer, older] in all three languages, then assert sha256 equality and spot-check that newest-wins applied correctly.

db-07 Execution

Library API (per language, same shape)

fn compact(inputs: &[SstReader], drop_tombstones: bool) -> Vec<u8>
  • inputs[0] is newest.
  • Returns the bytes of a db-06 SSTable.
  • Empty inputs → 36-byte empty SSTable.

CLI

compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...
  • IN1 is newest.
  • Output OUT.sst is byte-identical across rust/go/cpp for the same arguments.

Algorithm (pseudocode)

function compact(inputs, drop_tombstones):
  cursors = [iter(input) for input in inputs]   # each iter yields (key, entry) in key order
  heap = empty min-heap                          # entries: (key, src)
  for i, c in enumerate(cursors):
    k = c.peek()
    if k is not None: heap.push((k, i))

  out = SstWriter()
  while heap not empty:
    (k, i_win) = heap.pop()
    entry = cursors[i_win].next()                # consume winner
    if cursors[i_win].peek() is not None:
      heap.push((cursors[i_win].peek(), i_win))

    # Drain all older duplicates of the same key
    while heap not empty and heap.peek().key == k:
      (_, j) = heap.pop()
      cursors[j].next()
      if cursors[j].peek() is not None:
        heap.push((cursors[j].peek(), j))

    if entry.is_tombstone and drop_tombstones:
      continue
    out.add(k, entry)

  return out.finish()

Heap ordering: lexicographic on key; tiebreak by source index ascending (smaller index = newer = wins on equal keys).

How to wire it (per language)

LangCursorHeap
Ruststd::vec::IntoIter<(Vec<u8>, Entry)>BinaryHeap<Reverse<(Vec<u8>, usize)>>
Goindex into []struct{Key,Entry}container/heap with Less honoring (key,src)
C++pair of vector<...>::const_iteratorpriority_queue with greater-than comparator

All three "peek a cursor's current key" is cursors[i].keys[idx_i] (or equivalent) — there is no I/O during peek; entries are materialized once.

db-07 Observation

The canonical scenario

We build two SSTables (call them newer.sst and older.sst) and compact them in the order [newer, older].

newer.sst — produced from this MemTable scenario

memtable new
memtable bulk 50            # key0..key49 -> val0..val49
memtable put  "key10" "NEW-10"
memtable del  "key5"

So newer.sst contains 50 distinct keys, of which key10 has value "NEW-10", key5 is a tombstone, and the other 48 are val<i>.

older.sst — produced from this MemTable scenario

memtable new
memtable bulk 100           # key0..key99 -> val0..val99
memtable put  "key50" "OLD-50"

So older.sst contains 100 distinct keys, of which key50 is "OLD-50" and the others are val<i>.

Expected merged output

For every key the table picks the first input that contains it:

Key range / specific keyWinnerValue
key0..key4newerval0..val4
key5newerTombstone
key6..key9newerval6..val9
key10newer"NEW-10"
key11..key49newerval11..val49
key50older"OLD-50"
key51..key99olderval51..val99

Total distinct keys: 100. Tombstones: 1 (key5). Values: 99.

"What broken looks like"

BugSymptom
Tiebreaker swapped (older wins)key10 → "val10" instead of "NEW-10"; key5 → "val5" instead of tombstone.
Forget to drain duplicatesSstWriter::add returns Unsorted error (or "keys not strictly ascending").
Byte-vs-string comparisonOutput sha256 differs across languages on ASCII-only input only if a sort breaks.
Tombstone dropped when drop_tombstones=falseOutput has 99 keys instead of 100; key5 missing.
Tombstone kept when drop_tombstones=true at botOutput has 100 keys instead of 99; key5 still present as tombstone.

With drop_tombstones=true

Same inputs, run as bottom-level compaction:

  • key5 disappears entirely (newer's only entry for key5 was a tombstone).
  • 99 keys total, all values.

Hex of the absolute simplest compaction

Compacting [A, B] where A = [("k", T)] and B = [("k", V, "v")]:

  • drop_tombstones=false: output is an SSTable with one entry ("k", T). File size = 4 (block hdr) + 4+4+1 (entry hdr) + 1 (key) + 0 (value) + 4 (index count) + 4+8+8+1 (one index entry) + 32 (footer) = 79 bytes. This is the same as sstable build of a MemTable containing only ("k", T).
  • drop_tombstones=true: output is the empty SSTable, exactly 36 bytes.

Cross-language sha256 must match for both cases.

db-07 Verification

Ten properties, three implementations each.

V1 — Empty inputs

compact([], drop=false) → exactly 36 bytes, identical to SstWriter::new().finish() from db-06. Same for drop=true.

V2 — Single input passthrough

compact([A], drop=false) reproduces A's logical contents (same entries in same order). The bytes are not necessarily identical to A (block boundaries may differ if A had unusual block-target settings), but the output's entries() matches A's entries() exactly.

V3 — Newest wins on overlap

Inputs A = [("k", V, "new")], B = [("k", V, "old")]. Output contains ("k", V, "new") only. Output entry count = 1.

V4 — Tombstones win over older values

Inputs A = [("k", T)], B = [("k", V, "v")]. With drop=false, output contains ("k", T). With drop=true, output is empty.

V5 — Disjoint keys interleave correctly

Inputs A = [("b", V, "x"), ("d", V, "x")], B = [("a", V, "y"), ("c", V, "y")]. Output: ("a", V, "y"), ("b", V, "x"), ("c", V, "y"), ("d", V, "x") — sorted, no duplicates, every entry from its sole source.

V6 — Three-way merge handles transitive dedupe

Inputs (newest → oldest):

A: [("k", V, "v1")]
B: [("k", V, "v2"), ("z", V, "Z")]
C: [("k", V, "v3"), ("a", V, "A")]

Output: [("a", V, "A"), ("k", V, "v1"), ("z", V, "Z")]. K resolves to A's. Both B and C must advance past their "k" entries even though neither wins.

V7 — Canonical scenario byte-identity

Build newer.sst and older.sst as described in observation.md. Compact in each language with drop=false. Assert sha256 equality across all three languages.

V8 — SstWriter rejects an internally broken merge

If the merger forgets to drain duplicate cursors and tries to call SstWriter::add with the same key twice, the writer returns Error::Unsorted. The test for this constructs two inputs with overlapping keys and verifies that a correct compaction succeeds (i.e., we never see that error during a valid compaction).

V9 — Output is a valid db-06 SSTable

The bytes returned by compact open cleanly via SstReader::open and get(key) returns the merged version. This is the round-trip test.

V10 — Idempotent re-compaction

compact([compact([A, B])]) is byte-identical to compact([A, B]). Compaction of an already-compacted file is a no-op modulo metadata.

Cross-test (scripts/cross_test.sh)

Goes beyond V7 to also run a 3×3 reader/writer matrix on the merged file (byte-identity already implies this, but it confirms the output is portable):

  1. Build newer.sst and older.sst via db-05 → db-06 (Rust binaries; db-06 already proved byte-identity).
  2. Each language runs compact OUT.sst newer.sst older.sst.
  3. Assert sha256 match across the three OUT files.
  4. Each language reads each OUT file with sstable iter (from db-06) and asserts the iter output equals a reference (the Rust read of its own OUT).
  5. Spot-check sstable get OUT.sst key10value: 4e45572d3130 ("NEW-10") in all three.
  6. Spot-check sstable get OUT.sst key5tombstone.
  7. Spot-check sstable get OUT.sst key50value: 4f4c442d3530 ("OLD-50").

db-07 Broader Ideas

What you'd add next, in order of payoff

  1. Output splitting. Add compact_to_files(inputs, drop, target_bytes) -> Vec<Vec<u8>>. Implementation: switch SstWriter when the in-flight writer exceeds target_bytes. You must finalize at a key boundary (between two emitted entries), never inside a logical key, otherwise readers that depend on per-file key ranges will see overlaps.

  2. Streaming block iterator on SstReader. db-06's entries() materializes everything; the compaction loop should pull one entry at a time per cursor. This is db-08 territory (block cache + iterators).

  3. Range tombstones. A "delete all keys in [lo, hi)" record. Compaction has to track a set of active range tombstones during the merge and apply them to subsequent entries. Pebble's range-deletions doc is the reference.

  4. Snapshot-aware tombstone purging. "Drop tombstones at bottom" becomes "drop tombstones older than the oldest live snapshot". Compaction takes a sequence-number floor and drops anything below it that has been superseded.

  5. Leveled policy. A scheduler that picks N input files to compact based on per-level byte budgets and overlap. This is where Compaction::PickFile and IsBaseLevelForKey live in LevelDB.

  6. Subcompactions. Splitting one logical compaction into K parallel ones by key range. Requires that the index of each input lets you cheaply find the byte range covering a key span — partitioned index helps.

  7. Compaction throttling. When compaction can't keep up, foreground writes must stall. RocksDB exposes level0_slowdown_writes_trigger and level0_stop_writes_trigger. Without this, write bursts cause unbounded read amplification.

  8. Universal/tiered compaction. A different scheduler; same merge mechanism. Worth implementing once leveled is in to feel the trade-off.

  9. Per-key sequence numbers. Every key gets a monotonically-increasing seqnum; compaction picks the highest-seqnum entry for each key. This makes the merge correct under concurrent writes and snapshots. Required for MVCC (db-13).

  10. Compaction filter callbacks. RocksDB lets the user inspect/transform every key during compaction (garbage collection of TTL'd values, schema migration). It's just a hook in the emit step.

What this lab deliberately leaves un-clean for later

  • No async I/O. The merge is CPU-bound on materialized vectors.
  • No CRCs on blocks. Bad bytes in an input produce corrupt output silently.
  • No fsync / atomic rename. The CLI writes the output and the script renames.
  • No metrics. Production engines export bytes-read, bytes-written, files-in, files-out, duration per compaction.

These are intentional. The point of this lab is the merge, not the operational surface.

db-07 Step 1 — K-way merge core

The whole lab is one algorithm. We build it in three languages, then expose it through a tiny CLI.

Cursor

A cursor is an iterator over (key, entry) pairs from one input SSTable, in key order. The simplest representation: materialize via SstReader::entries() and index into the resulting vector.

#![allow(unused)]
fn main() {
struct Cursor {
    items: Vec<(Vec<u8>, Entry)>,
    pos: usize,
}
impl Cursor {
    fn peek(&self) -> Option<&[u8]> { self.items.get(self.pos).map(|(k,_)| k.as_slice()) }
    fn take(&mut self) -> (Vec<u8>, Entry) { let i = self.pos; self.pos += 1; std::mem::take_or_clone(&self.items[i]) }
}
}
type cursor struct {
    items []entry // entry = {Key []byte; E sstable.Entry}
    pos   int
}
func (c *cursor) peek() []byte { if c.pos >= len(c.items) { return nil }; return c.items[c.pos].Key }
func (c *cursor) take() entry  { x := c.items[c.pos]; c.pos++; return x }
struct Cursor {
    std::vector<std::pair<std::vector<uint8_t>, sstable::Entry>> items;
    std::size_t pos = 0;
    const std::vector<uint8_t>* peek() const {
        return pos < items.size() ? &items[pos].first : nullptr;
    }
};

Heap entry

(key bytes, source_index)

Min-heap ordered by key ascending, ties broken by source_index ascending (smaller index = newer = wins). All three implementations must use this exact ordering for byte-identity.

Emit loop

Pseudocode is in docs/execution.md. The crucial bit is the inner drain loop:

# After emitting (k, entry):
while heap.peek().key == k:
    (_, j) = heap.pop()
    cursors[j].take()  # discard the older duplicate
    if cursors[j].peek() is not None:
        heap.push((cursors[j].peek(), j))

That loop is the only difference between "K-way merge of disjoint inputs" and "K-way merge with newest-wins dedupe".

Why the tiebreaker direction matters

Inputs are passed newest first (index 0 newest). On a tie, the smaller index must come out of the heap first. So when you build a (key, src) tuple, the smaller src is the smaller tuple, and a min-heap pops it first. No need to invert; the natural lexicographic order on the tuple does the right thing.

If you ever flip the input convention (oldest first), invert the tiebreaker. Do not do both — pick one and document it. We picked: index 0 = newest.

Try this before reading step 2

Without looking at the implementation, on paper, trace this:

A: [("a",V,"1"), ("c",V,"3")]
B: [("a",V,"old"), ("b",V,"2")]

Write out the heap after each pop. You should get four pops and three emits.

The expected output: [("a",V,"1"), ("b",V,"2"), ("c",V,"3")].

db-07 Step 2 — Tombstone purging and the bottom-level rule

A tombstone in an SSTable says: "this key was deleted; do not return any older version of it". Tombstones cost space and slow down reads (you still have to walk past them). Eventually you want to drop them.

The rule

A tombstone for key k is safe to drop during a compaction if and only if there is no older version of k anywhere in the database that the tombstone could be hiding.

Equivalently: if this compaction is over the bottom-most level and the tombstone's input is part of it, you can drop the tombstone.

For non-bottom compactions, keep all tombstones. A deeper level still has data the tombstone is suppressing.

API

compact(inputs, drop_tombstones: bool) -> bytes

The flag is the caller's promise. We do not inspect it; we trust it. In a real engine, the scheduler sets drop_tombstones = (target_level == bottom).

Implementation: one if-statement

In the emit loop, after picking the winner (k, entry):

if entry.type == Tombstone and drop_tombstones:
    continue   # skip; do not write to output
out.add(k, entry)

That's the entire change versus the basic merge.

What's still wrong (and why it's OK for this lab)

The "drop tombstones at bottom" rule is a snapshot-unaware simplification. A correct engine keeps a tombstone alive until every read snapshot older than the tombstone's sequence number has been released. Implementing that requires per-entry sequence numbers, which we add in db-13.

For this lab, "bottom" means "the caller swears nothing older exists". That is enough to demonstrate the mechanism and to write a meaningful cross-test.

Test scenarios this enables

TestInputs (newest first)dropExpected output
Tombstone wins over older valueA=[(k,T)], B=[(k,V,"x")]false[(k,T)]
Tombstone dropped at bottomsametrue[] (empty SSTable, 36 bytes)
Tombstone for non-existent key keptA=[(k,T)]false[(k,T)]
Tombstone for non-existent droppedsametrue[]
Mixed values + tombstones, mid-levelA=[(a,V),(b,T)], B=[(a,V_old),(c,V)]false[(a,A.V),(b,T),(c,V)]
Same inputs at bottomsametrue[(a,A.V),(c,V)]

These are V4 in the verification table and the drop_tombstones=true arm of the cross-test.

db-07 Step 3 — CLI and cross-language byte-identity

CLI shape (all three languages emit and accept the same)

compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...

Arguments:

  • --drop-tombstones: optional first flag. If present, tombstones are dropped (use when this is the bottom-level compaction).
  • OUT.sst: output file path.
  • IN1.sst ...: one or more input SSTable paths. IN1 is the newest.

Exit codes:

  • 0: success.
  • 1: any error (open failure, malformed SSTable, write failure).
  • 2: usage error.

The CLI is intentionally minimal. There is no JSON, no stats, no progress. Stats live in db-22 (performance + benchmarking).

The cross-test scenario

The script in scripts/cross_test.sh:

  1. Builds feed_newer.mt (memtable scenario from observation.md, 50 keys with key10 replaced and key5 deleted).
  2. Builds feed_older.mt (100 keys with key50 = "OLD-50").
  3. Promotes both to SSTables using the db-06 Rust binary (sstable build feed_newer.mt newer.sst).
  4. For each language, runs compact OUT.sst newer.sst older.sst.
  5. Asserts sha256(rust.OUT) == sha256(go.OUT) == sha256(cpp.OUT).
  6. Runs the 3×3 read matrix using db-06's sstable iter over each OUT.
  7. Spot-checks sstable get OUT.sst <key> for key5, key10, key50, key99, nope.

The spot-checks use db-06's sstable CLI (not db-07's compact), which is why steps 5–7 don't need a separate db-07 reader: the output is a db-06 SSTable.

Why this proves the merge

Two SSTables with overlapping keys, where some overlaps prefer the newer version and one (key50) is unique to the older. If your merge logic gets the recency tiebreaker wrong, you read val10 instead of NEW-10. If you forget to drain duplicates, you write the same key twice and SstWriter::add throws. If you drop tombstones by mistake, key5 disappears.

If all three languages get the same sha256, the algorithm and its translation to three runtimes are pinned down.

db-08 — Block Cache and Iterators

What is it?

Two small, foundational read-path components that every LSM (and most B-tree engines) need:

  1. Block cache — a bounded, in-memory map from (file_id, block_offset) to the decoded block bytes, evicting the least-recently-used entry when full. Sits between the SSTable reader and the OS page cache so that a hot index block or hot data block does not have to be decoded on every query.
  2. Merging iterator — a streaming K-way merge over N pre-sorted sources (memtable, level-0 SSTables, level-1 SSTables, …) that yields each key exactly once, preferring the newest source on ties, and optionally drops tombstones. This is the engine of every LSM read: point lookups, range scans, compaction, and snapshot iteration.

Why does it matter?

In an LSM, a single user get("k") may have to consult the memtable plus 1–10 SSTables. Without a cache, every miss re-reads (and re-checksums, and re-decodes) blocks from disk; without a merging iterator, range scans cannot present a single ordered view of the live keyspace. Together these two components turn the LSM's "many small sorted runs" representation into the illusion of "one big sorted map" — and they do it without unbounded memory.

These primitives also appear far outside databases:

  • OS page cache is a block cache for files.
  • CPU L1/L2/L3 are hardware block caches keyed on physical address.
  • sort -m and most stream-join operators are merging iterators.
  • Kafka log compaction, Bigtable scans, and DynamoDB streams all do tournament-style merges across sorted inputs.

How does it work?

            ┌─────────────── BlockCache (cap = N entries) ───────────────┐
get(k)  ──► │  HashMap<(file_id, off), Node*>  +  DoublyLinkedList<Node> │
            │  hit:  splice node to front, return value                  │
            │  miss: insert at front; if full, drop the back node        │
            └────────────────────────────────────────────────────────────┘
                              │
                              ▼
            ┌─────────────── MergingIterator(sources) ───────────────────┐
            │  min-heap of (current_key, src_idx)                        │
            │  Next():                                                   │
            │    pop heap → winner                                       │
            │    advance winner src, push its next key (if any)          │
            │    while heap.top().key == winner.key:                     │
            │       pop & advance older (they are shadowed by winner)    │
            │    if drop_tombstones and winner is tombstone: continue    │
            │    yield (winner.key, winner.entry)                        │
            └────────────────────────────────────────────────────────────┘

Two invariants make this correct:

  • Per-source sort. Within one source, keys are strictly ascending. The heap therefore needs only the front of each source — never the full set.
  • Tie-break by source index. Source 0 is newest; on a tie, the newest entry wins and the older copies are drained without being yielded. This is how a put in the memtable shadows an old value in L1, and how a tombstone shadows a value of the same key in any older source.

Terminology

  • Block — a fixed-ish-size chunk of an SSTable (typically 4 KiB) that is the unit of disk I/O and the unit of block-cache eviction.
  • Cache hit / miss — was the requested key present in the cache?
  • Eviction — removing an entry to make room. LRU picks the entry least-recently touched (read or written).
  • MRU / LRU — most/least recently used end of the list.
  • K-way merge — merging K already-sorted sequences into one sorted sequence. Optimal comparison cost is O(N log K) for N total entries.
  • Tournament tree / min-heap — the data structure used to pick the next source to advance in O(log K).
  • Tombstone — a marker that says "this key has been deleted"; it shadows any older value for the same key until it is dropped during compaction.
  • Newest-wins — the LSM tie-break rule: source i < j means i is newer.

Mental models

  • The cache is a bounded hash map with a freshness order. The hash gives you O(1) lookup, the list gives you O(1) eviction of the stalest entry.
  • The merging iterator is a tournament. K runners, each in their own lane; the heap is the leaderboard; every Next() advances the current leader by one step and re-runs the comparison between the new front of that lane and the rest of the heap.
  • Tombstones are entries, not absences. They occupy a slot in the stream and only disappear during a compaction that is guaranteed to have seen all older versions of the same key.

Common misconceptions

  • "LRU is just a list." No — a list alone is O(N) per lookup. The hash map is what makes both operations O(1); the list only encodes the order.
  • "A merging iterator deduplicates by buffering everything." No. It inspects only the front of each source. Total memory is O(K), regardless of how many entries flow through.
  • "Newest-wins requires timestamps." Not in an LSM: source ordering already encodes recency (memtable > L0 > L1 > …). Timestamps are a separate concern for MVCC (db-13).
  • "A block cache replaces the OS page cache." It complements it. The OS caches raw file bytes; the block cache caches decoded/decompressed blocks and shortcuts the verification step (CRC checks, decompression).
  • "Tombstones can be dropped any time." Only during a compaction that includes the bottom level — otherwise an older live value could re-surface. See db-07 for the rule; db-08 lets the caller decide via a flag.

Talking points (interview-grade)

  • Why bound the cache by entries vs bytes? Entry-bounded is simpler and fine when blocks are roughly uniform (e.g., RocksDB's default 4 KiB blocks). Production systems bound by bytes (block_cache_size_mb) because compressed block sizes vary widely; we use entry count here to keep the data structure the focus of the lab.
  • Why a doubly-linked list and not a VecDeque? O(1) removal of an arbitrary node on hit-promotion. VecDeque only gives O(1) at the ends.
  • Why heap of (key, src) and not heap of full entries? Comparator cost: keys are small and comparable; entries (which may hold large values) are not. Also lets us move the entry out of the source vector with a single std::move / mem::take, avoiding copies.
  • Why does newest-wins also drain all older entries with that key? Otherwise the iterator would emit duplicates downstream, breaking the "exactly one entry per live key" contract that compaction and range scans depend on.
  • What about thread safety? Our block cache is single-writer-single-reader by design. Real systems use sharded caches (RocksDB: 64 shards) so each shard has its own mutex and contention is 1/Nth.

Connections to the rest of the curriculum

  • db-05 (memtable) is the newest source in every read-path merge.
  • db-06 (SSTable format) produces the sorted entries the merger consumes, and the blocks the cache caches.
  • db-07 (compaction) is itself a merging iterator with drop_tombstones=true whose output is fed to an SSTable writer. db-08 generalizes that machinery so the read path can use it for point lookups and scans as well.
  • db-09 (LevelDB-complete) wires this into a full Get/Scan path.
  • db-13 (MVCC) layers per-key snapshot filtering on top of a merging iterator like this one.

db-08 — References

Block cache / LRU

  • O'Neil, O'Neil, Weikum — "The LRU-K Page Replacement Algorithm For Database Disk Buffering" (SIGMOD 1993). The canonical "LRU is not the whole story" paper; explains why LRU under-performs LRU-K on database workloads.
  • LevelDB util/cache.cc — the reference shardless LRU used by LevelDB. Doubly-linked list + hash table; reads update recency on hit. Worth reading end-to-end; ~300 lines. https://github.com/google/leveldb/blob/main/util/cache.cc
  • RocksDB cache/lru_cache.{h,cc} and cache/clock_cache.{h,cc} — production-grade sharded LRU plus a clock-based variant. Demonstrates the shard-by-key-hash technique. https://github.com/facebook/rocksdb/tree/main/cache
  • CockroachDB Pebble internal/cache/ — a Go implementation with a modern API; useful for comparing language ergonomics. https://github.com/cockroachdb/pebble/tree/master/internal/cache
  • Postgres src/backend/storage/buffer/ — the canonical relational buffer pool: clock-sweep replacement with usage counts. Different policy, same role.

Merging iterators

  • Knuth, TAOCP Vol. 3 §5.4.1 — "Multiway Merging and Replacement Selection". The original analysis of K-way merge using a tournament tree and a loser tree.
  • LevelDB table/merger.cc and table/iterator.h — the canonical read-path merging iterator interface, plus the heap-based merger that combines memtable + level-0 + level-N+ iterators. https://github.com/google/leveldb/blob/main/table/merger.cc
  • RocksDB table/merging_iterator.{h,cc} — extended with range tombstones and pinned iterators. Shows how the interface evolves under production pressure.
  • Pebble internal/manifest/level_iter.go + merging_iter.go — a Go flavor with explicit handling of range deletes.

Background reading

  • Designing Data-Intensive Applications, Ch. 3 ("Storage and Retrieval"), pp. 70-89. Kleppmann's tour of LSM read amplification, bloom filters, and the role of the block cache.
  • Petrov, "Database Internals", Ch. 7 ("Log-Structured Storage"). Covers caching, iterators, and tombstone semantics at the level we implement.

Lab-specific notes

  • The canonical byte layout used by merge_iter is documented in src/rust/src/lib.rs on SerializeStream. It is deliberately minimal — its only job is to give us a byte-identical cross-language fingerprint for the sha256 check.
  • The cross-test reuses the same newer.mt/older.mt feedstock as db-07 so the two labs can be compared side-by-side. Their sha256s will differ because db-07 emits a full SSTable (with index, footer, padding) while db-08 emits a flat entry-stream, but the underlying ordering is identical.

db-08 — Analysis

Problem statement

We need two read-path primitives that the rest of the LSM stack assumes exist:

  1. A bounded in-memory cache that lets us amortize the cost of decoding SSTable blocks across many lookups, with predictable memory usage and O(1) operations.
  2. A streaming K-way merging iterator that exposes N pre-sorted sources as a single sorted, deduplicated stream — newest-wins on tie — without buffering all entries in memory.

Both must be small, dependency-light, and byte-deterministic when serialized (so the cross-language cross-test can detect any divergence).

Constraints

  • Determinism. Given identical inputs, the merge stream's serialized bytes must be identical across Rust, Go, and C++. This is the cross-test's only gate.
  • Bounded memory. The cache must cap at a user-supplied entry count; the iterator must use O(K) working set regardless of the number of entries.
  • No backtracking. The iterator is streaming: it must work on inputs that arrive lazily.
  • Newest-wins is strict. Source index 0 always wins. There are no timestamps, generations, or sequence numbers — that complexity is deferred to db-13 (MVCC).

Decisions

  • Cache eviction policy: LRU. Simple, predictable, well-understood. Not the best policy for all workloads (LRU-K, ARC, and CLOCK-Pro all beat it on scan-heavy workloads), but the correct teaching baseline.
  • Cache capacity unit: entries. Production systems bound by bytes; we use entries to keep the data structure (rather than the accounting) the focus.
  • Heap element shape: (key, source_index). Small and cheap to compare. Pulling the full entry into the heap would inflate comparator cost and force copies.
  • No timestamps / sequence numbers. Newest-wins is by source index alone.
  • Tombstone drop is opt-in. Callers pass drop_tombstones=true only when they have proven (via compaction rules — see db-07) that no older source could resurrect the deleted key.

Trade-offs

ChoiceProsCons
LRU (vs LRU-K, ARC, CLOCK)O(1) ops, simple to reasonScan-pollutes — one big scan can flush hot entries
Doubly-linked list (vs VecDeque)O(1) arbitrary removalHeavier per-node memory (two pointers)
Heap of (key, src) (vs entry)Cheap compares, no copiesIndirection back to source vector on every pop
Entry-bounded cap (vs byte)Simple, no per-block sizingMemory usage depends on block-size distribution
Drain-on-tie eagerlyCaller never sees duplicatesSlight extra work even when caller would dedupe

Risks

  • Heap ordering bug on tie. If the (key, src) comparator forgets to break ties on src ascending, the merger silently emits the older value on key collisions. The "newest-wins" test catches this on a 2-entry input.
  • Cache eviction at boundary. Inserting into a full cache and then immediately calling Get on the just-evicted key must miss, not hit.
  • Iterator reentrancy. Calling Next after end-of-stream must keep returning end-of-stream, not panic.
  • Cross-language drift on serialization. Endianness or length-prefix width mismatches would invalidate the sha256. We pin to "u32 LE length + bytes + u8 type [+ u32 LE val_len + val]".

Out of scope

  • Compression (RocksDB caches decompressed blocks; some configs cache both).
  • Pin/unpin handle protocol for zero-copy reads.
  • Snapshot/sequence-number-aware iteration (deferred to db-13).
  • Range deletes / range tombstones (deferred to db-21).
  • Block-cache statistics beyond hit/miss/evict.

db-08 — Execution

Build order

  1. Rust first: drives the canonical data-shape decisions (the Entry enum from db-06, the byte format of SerializeStream).
  2. Go second: ports the same algorithm with native data structures (container/list, container/heap).
  3. C++ third: same algorithm with std::list + std::priority_queue.

After all three pass their own unit tests, we run scripts/cross_test.sh which builds canonical input SSTables via db-05 + db-06 and asserts that all three merge_iter binaries produce the same sha256.

Per-language layout

Rust (src/rust)

  • Cargo.toml pulls in db-06's sstable crate by path = "../../../db-06-sstable-format/src/rust".
  • src/lib.rs defines BlockCache, MergingIterator, SerializeStream, and re-exports sstable::Entry. The cache uses a HashMap of slot indices plus a Vec<Node> arena with embedded prev/next indices and a free-list — an arena-based intrusive list, which beats LinkedList<T> on allocator pressure.
  • src/bin/merge_iter.rs is the cross-test CLI.

Go (src/go)

  • go.mod is module github.com/10xdev/dse/db08 with replace directives pointing to db-05 and db-06 on disk.
  • lru.go uses container/list.List and map[BlockKey]*list.Element.
  • iter.go uses container/heap with a []heapItem backing slice.
  • cmd/merge_iter/main.go is the CLI.

C++ (src/cpp)

  • CMakeLists.txt directly compiles db-06's sstable.cc into our sstable_lib rather than add_subdirectorying db-06 — that would leak db-06's add_test registrations into our ctest.
  • lru.{h,cc} uses std::list<Node> + std::unordered_map; Get uses list_.splice(begin, list_, it->second) for O(1) MRU promotion.
  • iter.{h,cc} uses std::priority_queue<HeapEntry, std::vector, Greater>.
  • src/merge_iter_bin.cc is the CLI.

Verification

  • scripts/verify.sh builds + tests all three languages.
  • scripts/cross_test.sh builds db-05 + db-06 input pipelines, generates the same newer.sst / older.sst used by db-07, runs merge_iter in all three languages, and asserts sha256 byte-identity for both drop_tombstones=false and drop_tombstones=true. It also spot-checks that "NEW-10", "OLD-50", "val99" appear in the stream and that the key5 tombstone framing (040000006b65793501) appears exactly when expected.

Reproducible cross-test sha256 (this lab's truth)

drop=false  → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true   → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)

The 9-byte difference is exactly the framing of one tombstone entry: u32_le(4) + "key5" + u8(1) = 4 + 4 + 1 = 9 bytes.

What you should be able to do after this lab

  • Sketch an LRU on a whiteboard in under three minutes and explain why both the hash map and the list are necessary.
  • Explain why a K-way merge uses a heap and not nested merges, and quote the O(N log K) comparison bound.
  • Identify, in any storage codebase, where the "newest-wins on tie" rule is enforced and where the "drain duplicates" step happens.
  • Argue when it is safe to drop a tombstone during iteration vs when it is not.

db-08 — Observation

What we measured (functional)

  • 11 Rust unit tests pass (cargo test): 4 LRU + 6 merger + 1 serializer.
  • 11 Go unit tests pass (go test ./...): 4 LRU + 6 merger + 1 serializer.
  • 2 C++ ctest binaries pass (test_lru covers 4 cases, test_iter covers 7).
  • Cross-language sha256 match in both modes (see execution.md).

Anatomy of the output stream

For the canonical input (newer = bulk 50 + put key10=NEW-10 + del key5; older = bulk 100 + put key50=OLD-50) the entry count is 100 with drop=false and 99 with drop=true. Total byte-count for the output stream:

  • drop=false: 1874 bytes
  • drop=true : 1865 bytes (delta 9 = exactly one tombstone frame)

Each value entry costs 4 (key_len) + len(key) + 1 (type) + 4 (val_len) + len(val) bytes. For our scenario, most keys are keyN (3-5 bytes) with values valN (4-5 bytes), making the per-entry frame ~17-19 bytes.

Hit/miss behavior under repeated workloads

The lru_basic_hit_miss test demonstrates the basic counters: one Get on a present key bumps hits to 1; one Get on an absent key bumps misses to 1. The lru_evicts_lru_on_capacity test confirms that the eviction counter increments exactly once when a fourth insert into a 3-slot cache forces the LRU node out.

Tournament dynamics

With K = 2 sources in the cross-test, the heap has at most 2 entries; with K = 7 (memtable + L0 file + 5 L1 files in a realistic LSM), the heap has at most 7 entries regardless of the millions of entries flowing through. Heap operations are O(log K) per Next(), so even at K = 1000 the per-entry cost is ~10 comparisons.

Determinism

The serialize_is_deterministic_and_sized test in all three languages constructs the same (key, entry) stream twice and confirms identical serialized bytes. This is what the cross-test relies on — if any language becomes non-deterministic (e.g., picks the wrong duplicate on a tie, or serializes value lengths in big-endian), the sha256 mismatch will surface immediately.

What surprised me

  • The C++ std::priority_queue is min-heap-by-default only if you pass an explicit std::greater-style comparator. Forgetting this gives a max-heap that emits keys in reverse order.
  • Rust's BinaryHeap is max-heap-by-default; we wrap in Reverse((key, src)) to flip it, which also automatically gives the correct tie-break on src ascending because Reverse(a) < Reverse(b) iff a > b and the derived tuple Ord compares lexicographically.
  • Go's container/heap requires you to write Less yourself, so the tie-break is explicit and self-documenting.

What did not surprise me

The hit/miss counts came out exactly as expected on first run for all three languages. The K-way merge produced a sorted stream on first run for Rust and Go.

db-08 — Verification

Cross-language byte identity (gating)

scripts/cross_test.sh is the gate. It builds canonical SSTable inputs and runs each language's merge_iter binary, comparing sha256 of the serialized merge stream.

Final results from the current run:

drop=false:
  rust: f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
  go  : f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
  cpp : f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
  match: f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1

drop=true:
  rust: ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
  go  : ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
  cpp : ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
  match: ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e

The 9-byte size delta between modes equals exactly one tombstone frame (u32_le(4) + "key5" + u8(1)), confirming that the only entry dropped is the expected one.

Stream-content spot-checks

The cross-test runs xxd -p | grep to confirm that:

  • NEW-10 (hex 4e45572d3130) appears — the merged-write semantics worked.
  • OLD-50 (hex 4f4c442d3530) appears — keys present only in the older source survive.
  • val99 (hex 76616c3939) appears — the largest bulk key from older shows up.
  • 040000006b65793501 (key5 tombstone framing) appears with drop=false and is absent with drop=true.

These are not redundant with the sha256 check: sha256 mismatch tells you something is wrong but not what; the framed-hex grep tells you which invariant broke.

Unit-test coverage matrix

BehaviorRustGoC++
LRU basic hit/miss + counters
LRU evicts LRU on capacity
LRU re-insert overwrites + promotes
LRU MRU-first key order after Get
Merger: empty inputs → empty output
Merger: single source passthrough
Merger: two-source interleave (no duplicates)
Merger: newest-wins on tie
Merger: tombstone kept when drop=false
Merger: tombstone dropped when drop=true
SerializeStream deterministic & expected size

How to re-verify locally

cd db-08-block-cache-and-iterators
bash scripts/verify.sh         # unit tests for all three languages
bash scripts/cross_test.sh     # cross-language byte-identity test

What would invalidate this proof

  • Changing SerializeStream's framing (lengths, endianness, type-byte encoding) — sha256 would diverge immediately.
  • Changing the (key, src) heap comparator to break ties on src descendingnewest-wins test fails before cross-test runs.
  • Changing the cache capacity unit from entries to bytes — the LRU tests would need recalibration but no other lab depends on the unit choice.

db-08 — Broader Ideas

What this lab teaches that goes beyond storage

The two primitives in this lab — bounded caches and tournament merges — are load-bearing in every layer of computing, not just databases.

Bounded caches

  • CPU caches (L1/L2/L3) implement set-associative LRU/PLRU in hardware with the same hash-map-plus-recency-order shape, just expressed in gates.
  • Page tables and TLBs are caches over the virtual-to-physical mapping; they share LRU's vulnerability to large scans.
  • HTTP caches (CDN edges, browser caches) cache responses keyed on URL with the same eviction problem and many of the same policies (LRU, LFU, TinyLFU, S3-FIFO).
  • Compiler caches (ccache, sccache, Bazel's CAS) cache build outputs keyed on the content hash of inputs — same data structure, different key.
  • JIT method caches in V8 and HotSpot cache compiled code; they too evict on capacity pressure.

The pattern is universal: bounded random-access store + recency or frequency order. Once you can implement and analyze LRU, you can swap in LFU, ARC, LRU-K, 2Q, CLOCK-Pro, TinyLFU, S3-FIFO, or W-TinyLFU by replacing the order without changing the index.

K-way merges

  • External sort (sort -m, MapReduce shuffle, Spark's sort-shuffle) is literally a K-way merge of sorted runs, identical in structure to ours.
  • Stream-stream joins (Flink, ksqlDB, Materialize) merge two ordered streams by key with a sliding-window predicate.
  • Time-series databases (Prometheus, InfluxDB, VictoriaMetrics) merge sorted chunks across files, then deduplicate by timestamp — newest-wins, with timestamp as the tie-breaker instead of source index.
  • Git's pack-objects merges sorted delta chains across pack files when serving a fetch.
  • Snapshot iteration in MVCC databases is a merging iterator with a per-key filter that drops versions newer than the snapshot's commit timestamp — exactly what db-13 will build on top of db-08.

"When does this break?"

  • LRU + scans. A long sequential scan pollutes the cache with entries the workload will never see again. Mitigations: scan-resistant policies (LRU-K, ARC), separate cache pools per access pattern, or O_DIRECT bypass.
  • K-way merge with very large K. When K approaches thousands (e.g., a Cassandra node with many SSTables on disk), O(log K) per-entry cost starts to bite. The fix is not a better merger but a compaction policy that keeps K bounded (db-07's job).
  • Tombstones outliving the keys they shadow. A delete-heavy workload produces tombstones faster than compaction can drop them; the merger spends increasing CPU skipping shadowed entries. Cassandra calls this "tombstone hell" and ships a tombstone_warn_threshold.
  • Cache stampede. Many threads simultaneously missing on the same key hammer the underlying storage; production systems add per-key locks ("singleflight" in Go's groupcache).

Extensions worth attempting

  1. Sharded LRU. Replace the single cache with N independent shards keyed on hash(file_id, offset) % N.
  2. TinyLFU admission filter in front of the cache (frequency sketch admits only entries seen more than once).
  3. Block-cache statistics beyond hits/misses/evicts: per-entry size, bytes resident, age histogram, top-N hot blocks.
  4. Bidirectional iterator. Add Prev() to support reverse range scans.
  5. Range-tombstone aware merger. Adding range deletes changes heap-pop semantics: a range tomb shadows a range of point keys.
  6. O(1) amortized doubly-linked list arena (Rust) that interns BlockKey to u32 indices to halve hash map memory.

Where this lab fits in the curriculum

After db-08, every later lab gets a free ride on these primitives:

  • db-09 wires BlockCache and MergingIterator into the LevelDB-complete Get/Scan paths.
  • db-13 (MVCC) layers snapshot-visibility filtering on top of a merging iterator just like this one.
  • db-14 (indexes / query optimization) builds secondary merging iterators for index-scan-then-fetch plans.
  • db-20 (distributed KV store) shards block caches across nodes and adds a network-aware admission policy.

Step 01 — LRU Block Cache

Goal

Implement a bounded, O(1) LRU cache keyed on (file_id, block_offset) holding decoded block bytes, with hit/miss/eviction statistics.

Spec

API (Rust signature; the Go and C++ APIs mirror it):

#![allow(unused)]
fn main() {
pub struct BlockKey { pub file_id: u64, pub offset: u64 }
pub struct CacheStats { pub hits: u64, pub misses: u64, pub evictions: u64 }

impl BlockCache {
    pub fn new(capacity: usize) -> Self;                       // capacity > 0
    pub fn get(&mut self, k: &BlockKey) -> Option<Vec<u8>>;    // promotes to MRU on hit
    pub fn insert(&mut self, k: BlockKey, v: Vec<u8>) -> bool; // returns true on overwrite
    pub fn len(&self) -> usize;
    pub fn capacity(&self) -> usize;
    pub fn stats(&self) -> CacheStats;
    pub fn keys_mru_to_lru(&self) -> Vec<BlockKey>;            // test-only
}
}

Behavior contracts:

  • get returns Some(v.clone()) on hit and moves that entry to MRU; bumps hits counter.
  • get returns None on miss; bumps misses counter.
  • insert on an existing key overwrites the value and promotes to MRU.
  • insert on a full cache evicts the LRU entry first; bumps evictions.
  • keys_mru_to_lru() returns the live keys in order; used by tests only.

Acceptance

cd src/rust && cargo test
cd src/go   && go test
cd src/cpp  && cmake -B build && cmake --build build && ctest --test-dir build

All four LRU tests pass in each language:

  • lru_basic_hit_miss
  • lru_evicts_lru_on_capacity
  • lru_reinsert_overwrites_and_promotes
  • lru_keys_order_mru_first

Discussion prompts

  • Why does get need a &mut self and not just &self? (Because it mutates the recency order, even though it only "reads" the cached value.)
  • What changes if you bound by total bytes instead of entries? (You need to weigh each entry on insert and evict in a loop until under cap.)
  • How would you make this thread-safe with minimum contention? (Sharded by hash(key) % N, one mutex per shard.)

Step 02 — Merging Iterator

Goal

Implement a streaming K-way merging iterator over N pre-sorted (key, Entry) sources, where source index 0 is newest and ties are won by smaller index. Support an optional drop_tombstones flag.

Spec

API (Rust signature):

#![allow(unused)]
fn main() {
pub struct MergingIterator { /* … */ }

impl MergingIterator {
    pub fn new(sources: Vec<Vec<(Vec<u8>, Entry)>>, drop_tombstones: bool) -> Self;
}

impl Iterator for MergingIterator {
    type Item = (Vec<u8>, Entry);
    fn next(&mut self) -> Option<Self::Item>;
}
}

Behavior contracts:

  • Each source is sorted strictly ascending by key with no within-source duplicates (caller's responsibility).
  • The merged stream is sorted strictly ascending; each key appears at most once.
  • On tie, source with smaller index wins; older sources are drained — i.e. their copies of that key are advanced past, not yielded.
  • If drop_tombstones=true, winning entries whose type is Tombstone are not yielded; the iterator continues to the next key.
  • The working set is O(K) regardless of N.

Canonical serialization (cross-test contract)

For each yielded (key, entry):

u32_le(len(key)) || key                                          // 4+|key| bytes
u8(type)                                                          // 1 byte; 0=Value, 1=Tombstone
if type == Value:
    u32_le(len(value)) || value                                  // 4+|value| bytes

This is what SerializeStream emits and what the cross-test sha256s.

Acceptance

cd src/rust && cargo test
cd src/go   && go test
cd src/cpp  && ctest --test-dir build

Six (Rust/Go) or seven (C++) merger tests must pass:

  • empty inputs → empty output
  • single source passthrough
  • two-source interleave with no duplicates
  • newest-wins on tie
  • tombstone kept when drop=false
  • tombstone dropped when drop=true
  • (SerializeStream deterministic & expected size)

Discussion prompts

  • Why not nested two-way merges? (Total work would be O(N · K) instead of O(N log K); for K=10 that's 3.3× worse and gets worse with K.)
  • Why is "drain duplicates" eager rather than lazy? (Lazy would force the caller to dedupe, breaking the invariant that the merger's output is the single source of truth for "what's live at this key".)
  • Where in real systems do you find tie-break-by-source-index? (LSM read path, time-series chunk merging, Kafka log compaction, anywhere "newer wins" without explicit timestamps.)

Step 03 — CLI and Cross-Language Test

Goal

Wrap the MergingIterator in a CLI binary (merge_iter) that reads N SSTables (built by db-06), runs a merge, and writes the canonical serialized stream to stdout. Then prove the three language implementations are byte-identical with scripts/cross_test.sh.

CLI spec

merge_iter [--drop-tombstones] IN1.sst IN2.sst ...
  • IN1 is the newest source; INk is the oldest.
  • Reads each input via db-06's SstReader, converts to Vec<(Vec<u8>, Entry)>, feeds all into a MergingIterator, calls SerializeStream, and writes the bytes verbatim to stdout.
  • Exit code: 0 success, 1 input error, 2 usage error.

Implementations:

Acceptance

Run the cross-test:

bash scripts/cross_test.sh

It must:

  1. Print match: lines with sha256s that are the same for all three languages (in both drop=false and drop=true modes).
  2. Confirm via hex spot-check that NEW-10, OLD-50, and val99 are present in the stream.
  3. Confirm the key5 tombstone framing (040000006b65793501) appears with drop=false and is absent with drop=true.
  4. End with CROSS-TEST OK.

Captured truth (current run):

drop=false  → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true   → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)

Discussion prompts

  • Why pipe the binary stream into sha256sum rather than diff the entry list? (A bytewise hash catches all serialization differences with a single number; it is the strongest possible equivalence test.)
  • The drop=true output is exactly 9 bytes shorter than drop=false. Where do those 9 bytes go? (u32_le(4) + "key5" + u8(1) = 4+4+1 = 9 — one tombstone frame.)
  • If you wanted to add a new entry kind (say, a "merge-add" delta), what would you change in the serialization? (Pick a new type byte (e.g. 2), decide its payload framing, document it, and update all three languages' SerializeStream and CLI in lockstep.)

db-09 — LevelDB Complete

What is it?

A small but end-to-end LSM-tree key-value store assembled from the labs we have built so far. It is the smallest interesting "real database" we can ship: opens a directory, durably accepts put/delete/batch writes, answers get/scan queries, and survives crashes — using the WAL (db-03), MemTable (db-05), SSTable format (db-06), and merging iterator (db-08) as its parts.

The engine deliberately stops short of automatic background compaction. That arrives in db-21; here we keep the focus on correctness of the read path across multiple immutable L0 SSTables and a live memtable.

Why does it matter?

This is the first lab where the labels start to look like the things you actually run in production:

  • Db::open(dir) + MANIFEST — every LSM-shaped store (LevelDB, RocksDB, Pebble, Cassandra's SSTable subsystem, HBase HFile) has exactly this contract: a directory is the database, a manifest enumerates which files are live, and recovery rebuilds in-memory state by reading the manifest and replaying the WAL.
  • The write path's three steps — encode batch → append+fsync WAL → apply to memtable — is the universal LSM commit. Almost every storage engine on Earth does these three things in this order. Once you internalize why (durability before visibility), you can read any LSM source tree.
  • The read path — memtable then newest SSTable then older SSTables, with the first hit (Value or Tombstone) winning — is the core LSM invariant. Compaction in db-21 is just amortizing this work; it doesn't change the rule.

If you understand this lab, you understand the shape of LevelDB.

How does it work?

                 ┌────── Db (one directory) ─────────────────────┐
                 │                                                │
   write path    │  WriteBatch ─► encode ─► WAL.append + fsync   │
   ───────────►  │                            │                  │
                 │                            ▼                  │
                 │                      MemTable (in RAM)        │
                 │                            │                  │
                 │                  size/explicit Flush          │
                 │                            ▼                  │
                 │   sst-NNNNNN.sst.tmp ─► fsync ─► rename       │
                 │                            │                  │
                 │            prepend (id, SstReader) to L0      │
                 │            rewrite MANIFEST atomically        │
                 │            close+delete+reopen WAL            │
                 │                                                │
   read path     │  Get(k):  MemTable → L0[0] → L0[1] → …        │
   ───────────►  │           first hit wins; Tombstone ⇒ None    │
                 │                                                │
                 │  Scan:    MergingIterator over                 │
                 │             [MemTable, L0[0], L0[1], …]       │
                 │             drop_tombstones=true               │
                 └────────────────────────────────────────────────┘

On-disk layout

<dir>/
  MANIFEST            text; one "L0 <id>" line per live SSTable, newest first
  wal.log             db-03 WAL of WriteBatch records (binary)
  sst-000001.sst      db-06 SSTable, one per flush, zero-padded 6-digit id
  sst-000002.sst
  ...

Recovery (Db::open)

  1. mkdir -p the directory.
  2. Read MANIFEST line by line; each line is L0 <id> newest-first.
  3. Open each SSTable in that order; track max_id.
  4. Open the WAL with WalIter and replay every record (WriteBatch::decode then apply to a fresh memtable). Any torn tail is dropped by the WAL iterator (db-03 invariant).
  5. Open the WAL for writes; set next_id = max_id + 1.

Write path

put(k, v) ≡ Write(WriteBatch{Put{k,v}})
del(k)    ≡ Write(WriteBatch{Delete{k}})

Write(batch):
    bytes  = batch.encode()
    wal.append(bytes); wal.sync()    # durability first
    apply(batch, memtable)           # then visibility

The batch wire format is identical in-memory and on-WAL:

u32 LE count
  for each op:
    u8 type          # 0 = Put, 1 = Delete
    u32 LE klen
    key bytes
    if Put:
      u32 LE vlen
      value bytes

This is the same encoder/decoder Rust, Go, and C++ all use, which is what makes the cross-language byte-identity test possible.

Flush

Flush():
    if memtable empty: return
    id = next_id++
    build SstWriter from memtable.sorted()
    write sst-id.sst.tmp; fsync; rename → sst-id.sst   # crash-safe publish
    prepend (id, SstReader) to ssts                   # newest first
    rewrite MANIFEST atomically (tmp + rename)
    wal.close(); remove(wal.log); wal = Wal::open(wal.log)
    memtable = MemTable::new()

The order matters: the SST must be durably renamed before we rewrite MANIFEST, and MANIFEST must be durably renamed before we truncate the WAL. If we crash between any two steps, recovery is safe — either the WAL still has the records, or the SST is on disk and listed in MANIFEST.

Read path

Get(k) walks the in-RAM memtable first, then SSTables newest-first. The first hit wins:

Source hit returnsResult
Value(v)Some(v)
TombstoneNone
misscontinue

If nothing matches, return None.

ScanAll() and SerializeView() reuse db-08's MergingIterator. The memtable is materialized into a KeyEntry vector (already sorted by BTreeMap or std::map), then the iterator merges it with each SSTable's entries, preferring the newer source on ties (memtable beats L0[0] beats L0[1] ...).

What's intentionally out of scope

  • Compaction — db-21. Without it, repeated overwrites of the same key accumulate as more L0 SSTables. Reads stay correct (newest wins) but per-Get work grows linearly in flush count.
  • Snapshots / MVCC — db-13.
  • Sharding, replication, sequence numbers — db-16+.
  • Bloom filters per SSTable — built in db-04; wiring is a db-21 optimization (skip SSTables whose Bloom rejects the key).

Cross-language invariant

All three implementations expose a dbctl --dir DIR CLI that reads a script from stdin (PUT k v, DEL k, FLUSH, DUMP, DUMP_WITH_TOMBS). scripts/cross_test.sh drives the same script through each, performs a crash/recover cycle by closing and reopening the database, then compares sha256(DUMP) and sha256(DUMP_WITH_TOMBS) across Rust, Go, and C++.

A byte-identical DUMP after recovery proves that all three implementations agree on: the WAL record format, the SSTable format, the MANIFEST format, the merge ordering, the tombstone semantics, and the recovery procedure.

db-09 — References

Primary sources

Production engines that share this shape

Read-path correctness

  • Mark Callaghan, Read, write & space amplification, 2018 — explains why the "newest source wins" rule is required and how compaction trades read amplification for write amplification. https://smalldatum.blogspot.com/2018/09/read-write-and-space-amplification.html
  • Pebble's docs/rocksdb.md for an excellent diff-style walkthrough of how a modern engine differs from LevelDB while preserving the same correctness invariants.

Crash safety

  • Pillai, Chidambaram, et al., All File Systems Are Not Created Equal, OSDI '14. The "fsync the file, then fsync the directory" rule we follow for SST publish and MANIFEST rewrite comes from this work.

Cross-lab dependencies

  • db-03 (WAL) — record framing, torn-tail tolerance, WalIter.
  • db-05 (MemTable) — sorted map with explicit tombstones.
  • db-06 (SSTable) — on-disk sorted-string file format with a footer and trailing checksum.
  • db-08 (BlockCache + MergingIterator) — k-way merge with newer-source-wins and optional tombstone dropping; canonical SerializeStream used by Db::serialize_view.

db-09 — Analysis

We are stitching together db-03/05/06/08 into the smallest engine that deserves the name database. The hard part is not any single component — we already have all of them — but choosing the smallest set of design decisions that yields crash safety and cross-language byte-identity.

Required invariants

  1. Durability of put — once put returns, a crash must not lose the write. Achieved by WAL append + fsync before applying to the memtable.
  2. Atomic publish of an SSTable — a recovering process must see either the complete SST or none of it. Achieved by write(.tmp) → fsync → rename. (POSIX rename is atomic with respect to crash.)
  3. Atomic publish of a flush — a recovering process must not see an SST that MANIFEST does not list, and must not see MANIFEST listing an SST that does not exist. Achieved by ordering: write SST → rename SST → rewrite MANIFEST → rename MANIFEST → truncate WAL. A crash between SST-publish and MANIFEST-rewrite leaks an unlisted SST file (harmless and reusable on the next flush via a higher id; we keep it simple and never reuse). A crash between MANIFEST-rewrite and WAL-truncate replays records that are already in the SST — MemTable::put is idempotent for the same key, so this is safe (the duplicate disappears on next flush).
  4. Read precedence — for any key k, the answer must come from the most recent writer. Order: memtable first, then SSTables in newest-to-oldest order. Tombstones count as a hit.
  5. Cross-language determinism — given the same input script, all three languages must produce byte-identical DUMP. Achieved by sharing exactly the formats defined in db-03/05/06/08 plus the WriteBatch wire format defined in this lab.

Design decisions

Why MANIFEST is plain text

LevelDB's MANIFEST is a binary record log of edits ("add file X to level Y", "delete file X", "set next file number to N", ...). That makes log replay fast but is not byte-identity-friendly across languages because each edit record carries varint-encoded fields and an internal version-edit format.

For this lab the live set is small (one process, no concurrent writers, no compaction) so we use the simpler representation: a text file rewritten on every flush, atomic by tmp+rename. The cost is one extra O(n) write per flush where n = number of live SSTables. For our small in-process loads, this is invisible. db-21 will replace it with the LevelDB-style edit log when compaction needs incremental atomic edits.

Why one SSTable per flush

LevelDB also writes one SST per flush; that's why they're "L0" files (level 0 is the only level where files may overlap). We keep the same property. "Newest L0 wins" then degenerates from a level-aware rule to a simple position-in-MANIFEST rule.

Why no compaction in db-09

Compaction is a separate concern: it's a background process whose only job is to reduce read amplification and reclaim space. Skipping it means:

  • Read cost grows linearly with flush count for keys that miss everything.
  • Disk usage grows monotonically — overwrites and deletes are never reclaimed.

Neither breaks correctness, and both are exactly what db-21 will fix. Splitting them keeps each lab small enough to fully verify.

Why the WriteBatch wire format is reused as the WAL record

Two formats are strictly worse than one: more surface area, more chances for a Rust/Go/C++ encoder to diverge. The batch encoder is the WAL serializer. The WAL framing (record-length + CRC32) is db-03's concern; the contents of each record is a single encoded batch.

Why three languages

The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations, the probability that a bug produces matching sha256s is vanishingly small, so a match line is a near-proof of correctness for the encode + flush + recover + read pipeline.

db-09 — Execution

What we built, in the order we built it.

1. Rust (src/rust)

  • Cargo.toml declares crate leveldb09 (lib) and a binary dbctl.
  • path dependencies to db-03-write-ahead-log, db-05-lsm-memtable, db-06-sstable-format, and db-08-block-cache-and-iterators. No network-fetched deps.
  • src/lib.rs defines Db, WriteBatch, Op, OpType, and re-uses the upstream types directly (wal::Wal, memtable::{MemTable, EntryType}, sstable::{SstReader, SstWriter}, blockcache::{MergingIterator, SerializeStream, EntriesFromReader}).
  • 11 inline #[cfg(test)] tests covering: batch round-trip, batch trailing- byte rejection, memtable put/get, delete-shadows-value, flush+memtable cleared, flush+recovery, WAL replay, newest-SST-wins, scan dedupe and tombstone drop, deterministic serialize_view, recovery with both an SST and a non-empty WAL tail.
  • bin/dbctl.rs is a stdin-driven CLI used by the cross-language script.

2. Go (src/go)

  • go.mod module github.com/10xdev/dse/db09 with replace directives pointing at the sibling labs' Go modules.
  • db.go ports the Rust API one-for-one. The WriteBatch wire format is byte-for-byte identical (u32 LE count, then per op: type byte, u32 LE klen, key, optional u32 LE vlen + value).
  • db_test.go mirrors all 11 Rust tests.
  • cmd/dbctl/main.go is the matching CLI.

3. C++ (src/cpp)

  • CMakeLists.txt compiles upstream .cc files directly into local static libraries (wal_lib, memtable_lib, sstable_lib, blockcache_lib). We do not add_subdirectory(../../../db-NN) because that would leak the upstream lab's add_test calls into our ctest.
  • src/db.h and src/db.cc provide db09::Db, constructed via Db::Open(dir) -> std::unique_ptr<Db>. Db is non-copyable and non-movable (its dse::wal::Wal member is itself non-copyable, and exposing moves would require fiddly forwarding for little gain in a one-process toy).
  • WAL move-assignment (wal_ = dse::wal::Wal::Open(path)) is what makes the post-flush WAL reset work; this required confirming the upstream header declares Wal& operator=(Wal&&) noexcept.
  • src/dbctl.cc and tests/test_db09.cc mirror their Rust/Go siblings. The test file uses #undef NDEBUG before <cassert> to guarantee asserts fire under Release builds.

4. Scripts

  • scripts/verify.sh builds and runs each implementation's unit tests.
  • scripts/cross_test.sh:
    1. Builds Rust/Go/C++ dbctl binaries.
    2. Defines one canonical command script (run.script) covering multi- flush, overwrites that land in newer SSTables, tombstones, and a non-empty WAL tail.
    3. For each language: pipes run.script into dbctl --dir db-LANG (writes + close), then reopens the same dir and pipes DUMP and DUMP_WITH_TOMBS into separate files. Reopen forces a real WAL replay and SST reload path.
    4. Computes sha256 of DUMP and DUMP_WITH_TOMBS for each language and asserts all three match.
    5. Spot-checks the rust DUMP stream hex for the presence of the expected final key-value bytes (b=222, e=5) and the expected tombstone bytes (key a in DUMP_WITH_TOMBS).

What we deliberately didn't build

  • Compaction — db-21.
  • Block cache wiring inside Db — db-08 has the cache; db-09 doesn't need it because each SSTable reader already holds the file bytes in memory. We'll plug in the LRU during db-21 when SSTable I/O becomes cold.
  • Bloom-filter probing — db-04 has bloom; db-21 will skip SSTables whose Bloom rejects the key.

db-09 — Observation

What the cross-language verification actually proves.

Output of scripts/cross_test.sh

=== compare (DUMP, drop_tombstones=true) ===
  DUMP         rust=7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
  DUMP         go  =7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
  DUMP         cpp =7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
  match(DUMP): 7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c
=== compare (DUMP_WITH_TOMBS) ===
  DUMP_TOMBS   rust=27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
  DUMP_TOMBS   go  =27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
  DUMP_TOMBS   cpp =27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
  match(DUMP_TOMBS): 27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d
=== spot-check stream contents ===
  spot-checks ok
=== ALL OK ===

What the canonical script exercises

PUT a 1                 # → memtable
PUT b 2                 #
PUT c 3                 #
FLUSH                   # → sst-000001.sst (a=1, b=2, c=3)

PUT b 22                # overwrite, lands in next SST
DEL a                   # tombstone, lands in next SST
PUT d 4
FLUSH                   # → sst-000002.sst (a=Tomb, b=22, d=4)

PUT e 5                 # WAL only, never flushed
DEL c                   # WAL only
PUT b 222               # WAL only

Live set after replay = {b=222, d=4, e=5} (a deleted, c deleted). With tombstones = the live set plus tombstones for a and c.

Sizes

DUMP (drop_tombstones=true):  35 bytes
  b=222 :  4(klen) + 1 + 1(type) + 4(vlen) + 3 = 13
  d=4   :  4       + 1 + 1       + 4       + 1 =  11
  e=5   :  4       + 1 + 1       + 4       + 1 =  11
                                                  ---
                                                   35  ✓

DUMP_WITH_TOMBS:  47 bytes
  35 (as above)
  + tombstone a: 4 + 1 + 1 = 6
  + tombstone c: 4 + 1 + 1 = 6
                              ---
                               47  ✓

The arithmetic matches the canonical byte format and the observed file sizes, which means we are not only matching sha256s but matching them on the right content.

What this proves

  1. WriteBatch encoder agrees — otherwise WAL records would differ and recovery would diverge.
  2. WAL framing + iterator agree — otherwise WAL replay would produce different memtables in the three languages.
  3. MemTable ordering + tombstone semantics agree — otherwise the merge would produce different streams.
  4. SSTable encoder agrees — otherwise SST files (and therefore the Entries() they yield) would differ.
  5. Recovery procedure agrees — the dump is taken after close and reopen, so any drift in MANIFEST parsing, SST id assignment, or replay order would surface as a sha256 mismatch.
  6. MergingIterator + SerializeStream agree — the same property db-08 verified, now exercised over a memtable+two-SST source set.

Any single bug in any of these six layers, in any one of the three languages, would break sha256 match. Matching is therefore very strong evidence of pipeline correctness end-to-end.

db-09 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

  • macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
  • cmake ≥ 3.20.
  • Rust toolchain ≥ 1.74 (rustup default stable).
  • Go ≥ 1.22.
  • shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-09-leveldb-complete
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-09-leveldb-complete/src/rust
cargo test --quiet
cargo build --release

Expected: 11 passed; 0 failed. The dbctl binary lands in target/release/dbctl.

Go

cd db-09-leveldb-complete/src/go
go test ./...
go build ./cmd/dbctl

Expected: ok github.com/10xdev/dse/db09 <duration>.

C++

cd db-09-leveldb-complete/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and the test_db09 binary prints OK.

What "green" means

A green run guarantees:

  • All 33+ unit tests pass (11 each in Rust, Go, C++).
  • The cross-language test produces byte-identical DUMP and DUMP_WITH_TOMBS after a close/reopen cycle.
  • Spot-checked hex bytes for b=222, e=5, and the tombstone for a are present in the stream — guarding against accidental empty-output regressions.

When verification fails

  • Cross-language sha256 mismatch — almost always a divergence in one of: WriteBatch wire format, MANIFEST line format, SST writer ordering, MergingIterator tie-break, or whether tombstones are dropped. The fix is almost never in db-09; it's in the upstream lab whose format drifted.
  • Recovery test fails in one language only — that language's WAL truncation step is wrong. Pattern (all three use it): close WAL → remove file → reopen WAL.
  • C++ ctest reports zero tests — you accidentally did add_subdirectory(../db-NN). Compile upstream .cc directly instead.

db-09 — Broader Ideas

Where to take this engine next, and where it already touches the rest of distributed-systems engineering.

Immediate next labs

  • db-10 — B-tree fundamentals. The "other half" of storage. LSMs optimize for write-heavy workloads with append-only files and amortized rewrites; B-trees optimize for in-place point updates and range scans. Both shapes appear in every production database (often side by side: Postgres heap + WAL is B-tree-like, with TOAST and rolled-back versions reaped by VACUUM; MySQL/InnoDB is B-tree primary + UNDO log).
  • db-21 — Storage-engine advanced. Compaction, leveled compaction policy, block cache wiring, bloom-filter probing, snapshots, file garbage collection. Everything that "real" LevelDB/RocksDB has that we postponed in db-09.

How this lab's pieces show up in distributed systems

  • MANIFEST as a tiny version-edit log is a microcosm of how distributed systems use a log to make state changes atomic. A Raft log is the same pattern at machine granularity: apply changes to a state machine only after they're durably appended to an append-only log.
  • rename for atomic publish is the local-filesystem analogue of two-phase commit. The OS gives us a strong primitive (rename is atomic under crash) and we lean on it. Distributed systems have to build equivalent primitives (Paxos / Raft / 2PC) because no underlying layer provides them for free.
  • Newest-source-wins under a total order on writes is exactly how a CRDT LWW-register, a multi-version concurrency control snapshot read, a Kafka log-compaction "last value wins" topic, and a Bigtable per-cell-with-timestamp work. The variable that changes between systems is what defines "newer" (file id here; sequence number in LevelDB proper; timestamp in Cassandra; vector clock in Dynamo).

Performance experiments worth running later

These are not required for the lab to be green; they are good Saturday afternoon explorations:

  • Plot read-amplification growth as L0 grows: write N batches, never flush, measure point-lookup latency vs N.
  • Replace text MANIFEST with LevelDB-style version-edit log; measure flush latency improvement at large live-set counts.
  • Add a block cache between SstReader::Get and the file bytes; measure hit rate on a Zipfian workload.
  • Wire bloom filters (db-04) per SSTable; measure how many SSTs you can skip for a typical miss-heavy workload.

What "production-ready" would require beyond this lab

  • Concurrent writers (a real Mutex on the write path, multiple readers via versioned snapshots).
  • Group commit (batch many WAL appends behind one fsync).
  • Direct I/O / pwrite-based SST writer to avoid double-buffering.
  • Checksums on every block read, not only at SST footer level.
  • A scheduler for background flush + compaction with admission control.
  • fsync(dir) after every file create / rename to survive metadata-loss scenarios on certain filesystems.

None of these change the shape of the engine — they make the same shape faster and tougher.

db-09 step 01 — The write path

Goal

Implement Db::open(dir), put(k,v), delete(k), and Write(WriteBatch) such that every successful return has been durably persisted to the WAL.

Tasks

  1. Pick the on-disk layout (MANIFEST, wal.log, sst-NNNNNN.sst).
  2. Define the WriteBatch wire format. Use a single encoder/decoder so the in-memory batch representation and the WAL record payload are identical bytes.
  3. On open(dir):
    • mkdir -p the directory.
    • Read MANIFEST (if it exists) one line at a time; collect SST ids newest-first.
    • Open each SSTable in order; track max_id.
    • Replay wal.log with WalIter: decode each record as a batch and apply to a fresh memtable.
    • Open the WAL for writes; set next_id = max_id + 1.
  4. Implement Write(batch):
    • Reject the empty batch as a no-op (don't write an empty WAL record).
    • bytes = batch.encode(); wal.append(bytes); wal.sync();
    • Apply the batch to the memtable.
  5. put and delete are thin wrappers that build a one-op batch.

Acceptance

Inline unit tests:

  • batch_roundtrip — encode → decode round-trip preserves three representative ops (Put, Delete, Put-with-empty-key).
  • batch_rejects_trailing — decoding rejects a one-byte-suffix-corrupted payload.
  • put_get_memtableput("a","1") followed by get("a") returns Some("1"); get("missing") returns None.
  • delete_shadowsput then delete makes get return None.

All four green in Rust, Go, and C++.

Discussion prompts

  • Why must wal.sync() happen before applying to the memtable, not after?
  • What invariant would break if we let Write proceed for an empty batch?
  • How would a group-commit optimization preserve the same durability guarantee while batching multiple Write calls behind a single fsync?

db-09 step 02 — Flush and recovery

Goal

Implement Flush() and recovery such that crashes between any two file operations never produce an inconsistent live set.

Tasks

  1. Implement Flush() as the strict sequence:

    1. If memtable is empty, return.
    2. Allocate id = next_id; next_id += 1.
    3. Build an SstWriter from memtable.sorted(). For each entry, map EntryType::Value→Value (with bytes) and EntryType::Tombstone→ Tombstone (empty value).
    4. Write sst-<id>.sst.tmp durably (open + write + fsync).
    5. rename it to sst-<id>.sst.
    6. Prepend (id, SstReader) to the in-memory ssts list (newest first).
    7. Rewrite MANIFEST atomically: write MANIFEST.tmp durably (one L0 <id> line per live SST, newest first), then rename to MANIFEST.
    8. Close the WAL, remove wal.log, reopen the WAL.
    9. Replace memtable with an empty one.
  2. Verify the recovery sequence implemented in step 01 still satisfies the crash matrix:

    Crash between …Effect on next open
    step 4 and 5leftover *.tmp file, ignored on next open
    step 5 and 7leftover unlisted SST file, ignored on next open
    step 7 and 8replayed WAL re-applies writes that are also in the latest SST — idempotent because MemTable::put is overwrite
    step 8 and 9impossible — both are in-memory only after this point

Acceptance

Inline unit tests:

  • flush_creates_sst — after Flush(), memtable empty and LiveSstIds().len() == 1; reads still work.
  • flush_then_recoverFlush(), drop Db, reopen, reads still return the flushed values.
  • wal_replay — without flushing, drop Db, reopen; memtable has the pre-crash writes.
  • newest_sst_wins — two flushes with overlapping keys; the value from the newer flush is returned.
  • recovery_after_flush_plus_wal — mix: flush, then write more (tombstones + puts) without flushing, drop, reopen; reads reflect both the flushed and the WAL-only writes correctly.

All five green in Rust, Go, and C++.

Discussion prompts

  • Why prepend instead of append to the ssts list?
  • Why is it safe to truncate the WAL even when the new MANIFEST may not yet be fsync'd to its parent directory?
  • What would change if step 7 used an edit log (append a "+id" record) instead of rewriting the whole file?

db-09 step 03 — CLI and cross-language byte-identity

Goal

Build a dbctl --dir DIR CLI in all three languages that reads commands from stdin, then assert via sha256 that all three produce byte-identical output for the same canonical script — including after a crash/recover cycle.

CLI contract

Each line of stdin is one of:

# comment (ignored)
PUT  <key>  <value>      # whitespace-delimited (no spaces inside)
DEL  <key>
FLUSH
DUMP                     # write serialize_view(drop_tombstones=true) to stdout
DUMP_WITH_TOMBS          # write serialize_view(drop_tombstones=false) to stdout

Blank lines and lines starting with # are ignored.

DUMP and DUMP_WITH_TOMBS write raw bytes (no trailing newline) so that sha256 over stdout is a pure function of the database state.

Tasks

  1. Build dbctl in Rust (src/rust/src/bin/dbctl.rs), Go (src/go/cmd/dbctl/main.go), and C++ (src/cpp/src/dbctl.cc).
  2. Write scripts/cross_test.sh that:
    1. Builds all three binaries.
    2. Creates one canonical command script that exercises multi-flush, overwrites that land in newer SSTables, tombstones, and a non-empty WAL tail.
    3. For each language: pipes the script into dbctl --dir db-LANG (which fully writes and closes), then reopens the directory and pipes DUMP (and separately DUMP_WITH_TOMBS) into a file.
    4. Computes sha256 over each dump file; asserts all three match.
    5. Spot-checks the rust DUMP stream hex for the expected post-recovery key-value bytes to guard against silent-empty regressions.
  3. Write scripts/verify.sh that runs unit tests in all three languages.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
...
  match(DUMP):       7d1568c7...
  match(DUMP_TOMBS): 27e3d256...
  spot-checks ok
=== ALL OK ===

A byte-identical DUMP after reopen is a near-proof of correctness for the entire encode → flush → MANIFEST → recover → merge → serialize pipeline across three independent implementations.

Discussion prompts

  • Why force a close+reopen between the writes and the DUMP, instead of dumping from the same process?
  • Why is DUMP (without tombstones) sufficient on its own not a sound proof? What does DUMP_WITH_TOMBS add?
  • If the three sha256s ever diverge, which lab's format is the most probable culprit, and why?

db-10 — B-Tree Fundamentals

The first lab of the B-tree track. Up to db-09 every persistent structure we built was an LSM (log + sorted runs + merge). Postgres, MySQL/InnoDB, SQLite, Oracle, SQL Server, and most embedded key-value engines you have never heard of are B-trees instead. This lab builds the in-memory kernel; db-11 wraps it in a pager so it can live on disk.

What is it?

A self-balancing search tree where every node holds up to 2T - 1 keys (and, if internal, up to 2T children). We pick the smallest non-trivial degree T = 2, giving:

  • 1 ≤ keys ≤ 3 per non-root node
  • 2 ≤ children ≤ 4 per non-root internal node
  • root may hold 1..3 keys (or 0 if the tree is empty)

The algorithms are the textbook CLRS B-tree: insert splits a child proactively while descending if it is full; delete rebalances a child proactively while descending if it would underflow. With this discipline every operation is exactly one root-to-leaf traversal — no second pass, no recursion back up to fix invariants.

Keys and values are arbitrary byte slices; comparison is lexicographic. Each node carries the value of every key it holds (this is a B-tree, not a B+-tree — values do not live exclusively in the leaves). db-11 will make the leaf-only choice when we introduce the pager and need to keep internal nodes small.

Why does it matter?

  • Predictable depth. log_T(n) height with T=2 gives a small, perfectly bounded number of comparisons per lookup, no matter the insertion order. LSMs trade log writes for O(log levels) read amplification; B-trees trade page rewrites for a tight bound.
  • In-place update. A B-tree key update mutates exactly one node. LSMs append a new record and reclaim the old one during compaction. Which is better depends on workload — db-22 will measure it.
  • The canonical study substrate. Every working storage engineer has implemented a B-tree at least once. Splits and merges are the microcosm of every concurrent, copy-on-write, or page-versioned variant that exists in production code.

How does it work?

Node layout

        ┌─────────────────────────── Node ────────────────────────────┐
        │  is_leaf : bool                                             │
        │  keys    : Vec<(key, value)>      // 1..3 entries           │
        │  children: Vec<Box<Node>>         // 0 if leaf, else nkeys+1│
        └─────────────────────────────────────────────────────────────┘

Internal node with two keys (k0 < k1):

        ┌──────────┬──────────┐
        │   k0,v0  │   k1,v1  │
        └─┬──────┬─┴────────┬─┘
          │      │          │
          ▼      ▼          ▼
        c0 keys  c0<…<k0    k0<…<k1    c2 keys k1<…

Insert (proactive split)

Descend from the root. Before stepping into any full child (nkeys == 3), split that child in place: promote its middle key to the parent, drop the right sibling into the parent's child list at position i+1, and let the new (now non-full) child take the descent. If the root itself is full, grow upwards: create a new parent with the old root as its only child, then split. This is the only place tree height increases.

Before split (child too full):     After split (middle promoted):

   [   K   ]                          [ K , k1 ]
        │                              │      │
        ▼                              ▼      ▼
  [k0, k1, k2]                       [k0]    [k2]

Delete (proactive rebalance)

Descend from the root looking for the key. Before stepping into any child that has only T - 1 = 1 key, ensure it has at least T = 2 keys by one of:

  1. Borrow from left sibling — rotate left sibling's last key up into the parent, parent's separator down into the child's front.
  2. Borrow from right sibling — symmetric.
  3. Merge with a sibling — pull the parent's separator down, concatenate child + separator + sibling into a single node with 2T - 1 = 3 keys.

If the root becomes an empty internal node (only one child, no keys) after the operation, collapse it: the root's only child becomes the new root. This is the only place tree height decreases.

Deletes that hit an internal key are handled by replacing the key with its in-order successor (or predecessor) and recursing the delete into that subtree, where the recursion terminates at a leaf.

Canonical serialization

A preorder traversal of the tree emitting, per node:

u8     is_leaf            (1 if leaf, 0 if internal)
u32 LE nkeys
nkeys * { u32 LE klen | klen bytes key | u32 LE vlen | vlen bytes val }
if !is_leaf:
   (nkeys + 1) * recurse(child)

The empty tree therefore serializes as five bytes: 01 00 00 00 00 (one leaf node with zero keys).

This format captures structure, not just contents. Two trees with the same {(key, value)} set but different splits / shapes produce different byte sequences — so scripts/cross_test.sh would catch a language whose insertion order or split rule diverged, even if the externally-visible scan output still agreed.

Deterministic workload

run_workload(scenario, seed, ops) drives a fresh tree using SplitMix64(seed) to generate keys (8-byte big-endian indices modulo a 200-entry key space) and values (4-byte big-endian). The three scenarios:

scenarioper-iteration behavior
insertsalways insert(key, val)
deletesinsert during the first half, delete(key) during the second
mixedbits 62..63 of r1 decide: insert (0,1), delete (2), no-op (3)

Two PRNG outputs are consumed per iteration regardless of which branch is taken, so the key sequence is invariant under the scenario choice and only the operation kind differs. This makes the three scenarios easy to reason about: they all visit the same keys in the same order.

The btreectl CLI

btreectl --seed N --ops M --scenario {inserts|deletes|mixed}

Runs the chosen workload and writes the serialized tree to stdout (raw bytes, no trailing newline).

Cross-language invariant

scripts/cross_test.sh runs the same (seed, ops, scenario) triple through Rust, Go, and C++ btreectl binaries and asserts that all three produce byte-identical output via sha256 for two scenarios:

scenarioseedopssha256size
A inserts425004b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088varies
B mixed75009edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e377182515 B

A matching hash proves that all three implementations agree on: the PRNG, the lexicographic key compare, the proactive-split insertion, the proactive-rebalance deletion, and the precise tree shape after the workload. Any drift in any of these surfaces as a sha256 mismatch.

What's intentionally out of scope

  • Persistence. db-11 introduces the pager and turns nodes into fixed-size disk pages.
  • B+-tree leaves-only-values layout. Also db-11; it's the natural change once internal nodes need to fit one to a page.
  • Concurrent / lock-coupling B-trees. db-13 (MVCC) and db-21 (storage-engine advanced) explore copy-on-write and latch protocols.
  • Variable-length keys with prefix compression. SQLite and RocksDB both do this; we will revisit in db-15.

db-10 — References

Primary source

  • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 3rd or 4th ed., MIT Press. Chapter 18 (B-Trees) is the textbook treatment whose proactive- split / proactive-rebalance discipline we follow line-for-line. This is the single most important reference for the lab.

Original papers

  • R. Bayer & E. McCreight, Organization and Maintenance of Large Ordered Indices, Acta Informatica 1, 1972. The paper that introduced the B-tree.
  • D. Comer, The Ubiquitous B-Tree, ACM Computing Surveys 11(2), 1979. The classic survey; explains B+, B*, and variants. A useful read before starting db-11 where the leaves-only layout enters.

Production engines that use B-trees or B+-trees

Cross-lab dependencies

  • None upstream. db-10 is the start of the B-tree track and imports no earlier labs.
  • Downstream consumers: db-11 (pager) wraps each node in a fixed- size disk page; db-12 (SQL frontend) treats the tree as the table storage layer; db-13 (MVCC) snapshots node references rather than page bytes; db-14 (indexes) builds secondary B-trees over the primary tree's keys.

db-10 — Analysis

What had to be decided before any code was written, and why each decision shapes the next 5 labs.

Required invariants

  1. Search-tree order. For every internal node with keys k0 < k1 < … < kn-1 and children c0, c1, …, cn, every key in c_i is < k_i and every key in c_{i+1} is > k_i.
  2. Bounded fanout. Non-root nodes hold between T - 1 and 2T - 1 keys (1..3 with T = 2). The root may hold fewer keys, only when the tree is otherwise empty or being collapsed.
  3. Uniform depth. All leaves sit at the same depth from the root. This is what makes the worst-case lookup guaranteed to be O(log_T n), not merely expected.
  4. Proactive split / rebalance. The descent on insert never needs to back up to fix an overflow; the descent on delete never needs to back up to fix an underflow. Each mutating operation touches each level on the path exactly once.
  5. Canonical serialization. Two B-trees with the same shape must serialize to the same bytes regardless of insertion order; two B-trees with different shapes must serialize to different bytes even if they hold the same key-value set.

Design decisions

Why T = 2 (smallest non-trivial degree)

Larger T means flatter trees and more keys per page — what real B-trees use to amortize disk I/O. But the algorithms are identical at every T ≥ 2, and T = 2 makes splits and merges frequent, which makes them easy to spot, easy to unit-test, and easy to render in a hex dump. db-11 will bump T to something realistic (e.g. matching a 4 KiB page) once nodes are pages.

Why B-tree, not B+-tree

A B+-tree puts values only in the leaves and threads the leaves into a doubly-linked list for range scans. That's the right call once nodes are disk pages — internal nodes shrink because they don't carry values, so fanout (and therefore depth) wins. In-memory, with T = 2, the values-in-every-node B-tree is simpler and the savings would be invisible. db-11 swaps to a B+-tree when the disk-page trade-off applies.

Why the wire format encodes structure, not just contents

Two trees with the same {(k, v)} set can have different shapes if they were built by different insertion orders. A serializer that only emits the in-order key list (essentially scan()) would let a serious bug — say, swapping the left and right halves of a split — hide forever, because the bug would manifest only as different tree shapes, never different scan results.

By emitting the full preorder shape, byte-equality across languages is byte-equality of the trees' physical state. db-11 will reuse this property: the page-byte serialization of a B+-tree should be exactly reproducible across implementations.

Why the workload generator reads two PRNG outputs unconditionally

Each run_workload iteration consumes exactly r1 and r2, regardless of whether the chosen scenario insertions, deletes, or no-ops. If a scenario consumed a variable number of PRNG draws, the sequence of keys would diverge across scenarios for the same seed, making the cross-scenario hashes incomparable and the bug hunt much harder.

The cost: a small amount of wasted entropy on no-op iterations. The gain: scenarios inserts, deletes, and mixed all visit the same key-space in the same order for the same seed, so any divergence is the operations' fault, not the keys'.

Why scenarios live in the library, not in the CLI

run_workload(...) is a library function that returns a BTree. The btreectl binary is a one-liner around it. This means the inline unit tests can call run_workload("mixed", 42, 500) directly and assert determinism with no shell-out, no file I/O, and no path-dependent flakiness. The same property lets cross_test.sh trivially compare three independent CLI binaries.

Why three languages

  • Forces the API to be small and explicit. The Rust Box<Node> recursion translates to Go's struct pointer recursion and C++'s std::unique_ptr<Node> recursion; if the algorithm needs language-specific cleverness, you've over-fit to one runtime.
  • Pins integer arithmetic. SplitMix64 uses wrapping unsigned multiplication; expressing it identically in three languages is a forcing function for the cross-language hash to match.
  • Provides a deterministic conformance suite for the whole B-tree track. When db-11's pager produces a tree whose in-memory shape disagrees with the pure in-memory baseline, db-10's serializer is the comparison witness.

Tradeoffs worth flagging

  • The serializer recurses on the call stack. For pathologically deep trees this could overflow. With T = 2 and 64-bit keys drawn from a 200-key space, the worst-case height is roughly log_2 200 ≈ 8 and the stack is never the bottleneck. db-11's paged variant will be even shallower and is fine to keep recursive.
  • Keys and values are stored as owned Vec<u8> / []byte / std::vector<uint8_t>. This is the simplest correct choice and it dominates allocation cost. db-22 (perf) will revisit whether to intern, slice, or arena-allocate.
  • delete returns bool (was-present) rather than the removed value. Sufficient for testing; some real engines need the payload (e.g., to free its backing buffer). Easy to extend.

db-10 — Execution

What was built, in the order it was built.

1. Rust (src/rust)

  • Cargo.toml declares crate btree10 (lib) and a binary btreectl. Edition 2021, lto = "thin", codegen-units = 1 for release.
  • src/lib.rs contains:
    • Constants T = 2, MAX_KEYS = 3, MIN_KEYS = 1.
    • Node { is_leaf, keys: Vec<(Vec<u8>, Vec<u8>)>, children: Vec<Box<Node>> }.
    • BTree with new, get, insert, delete, serialize_tree, scan, len, is_empty.
    • Free functions split_child, insert_nonfull, delete_from, plus the rebalance helpers borrow_from_prev, borrow_from_next, merge_children.
    • SplitMix64 PRNG (the textbook wrapping-add + xor-mul mix).
    • run_workload(scenario, seed, ops) -> BTree.
    • Inline #[cfg(test)] tests: empty-tree shape, single insert+get, insert + scan ordered, delete-of-absent returns false, delete-then-get returns None, deterministic shape under the three scenarios, scenario-cross seed independence.
  • src/bin/btreectl.rs: thin arg parser (--seed, --ops, --scenario), calls run_workload, writes serialize_tree() bytes to stdout.

2. Go (src/go)

  • go.mod module github.com/10xdev/dse/db10, Go 1.22.
  • btree.go ports the Rust API one-for-one. Pointer-based recursion: *node instead of Box<Node>. The serializer is byte-identical to Rust's: same preorder, same little-endian encodings.
  • btree_test.go mirrors all Rust tests.
  • cmd/btreectl/main.go is the matching CLI.

3. C++ (src/cpp)

  • CMakeLists.txt builds:
    • btree10_lib (static library from src/btree.cc).
    • btreectl (binary linking btree10_lib).
    • test_btree10 (ctest target linking btree10_lib).
    • Flags: -Wall -Wextra -Wpedantic -Werror -O3 -DNDEBUG in Release.
  • src/btree.h declares Node, BTree, run_workload, SplitMix64.
  • src/btree.cc implements them. std::unique_ptr<Node> plays the role of Rust's Box<Node>.
  • src/btreectl.cc is the CLI.
  • tests/test_btree10.cc mirrors Rust's inline tests. Uses #undef NDEBUG before <cassert> so asserts fire under Release; never assert(side_effect).

4. Scripts

  • scripts/verify.sh builds and runs unit tests in all three languages. Exits 0 only if all three are green; prints === OK ===.
  • scripts/cross_test.sh:
    1. Builds Rust/Go/C++ btreectl binaries.
    2. Scenario A: btreectl --seed 42 --ops 500 --scenario inserts in each language; sha256 + size comparison.
    3. Scenario B: btreectl --seed 7 --ops 500 --scenario mixed; sha256 + size comparison.
    4. Spot-check on the rust scenario-A output: assert a known key-prefix appears in the hex stream, guarding against silent-empty-output regressions.
    5. Print === ALL OK ===.

What was deliberately not built

  • Persistence. No file I/O, no page format. db-11.
  • Range scans with iterator-style streaming. scan() returns the whole list; sufficient for tests, lazy for the spec.
  • Bulk-loading from a sorted input. A real B-tree would offer a fast path that builds the tree bottom-up. db-15 may revisit.
  • Concurrency control. No latches, no locks. Trees of T = 2 fit comfortably in a single thread's working set and the lab has no concurrent test harness.

db-10 — Observation

What the cross-language verification actually proves, and what the serialized stream looks like by hand.

Output of scripts/cross_test.sh

=== compare Scenario A (inserts seed=42 ops=500) ===
  A          rust=4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 (    ???? B)
  A          go  =4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 (    ???? B)
  A          cpp =4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 (    ???? B)
  match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
=== compare Scenario B (mixed seed=7 ops=500) ===
  B          rust=9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 (    2515 B)
  B          go  =9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 (    2515 B)
  B          cpp =9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 (    2515 B)
  match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== spot-check stream contents ===
  spot-checks ok
=== ALL OK ===

Reading the stream by hand

The empty tree is exactly five bytes:

01           is_leaf = 1
00 00 00 00  nkeys   = 0

After one insert (key="a", val="1"):

01                    is_leaf = 1
01 00 00 00           nkeys   = 1
01 00 00 00           klen    = 1
61                    key     = "a"
01 00 00 00           vlen    = 1
31                    val     = "1"

After the fourth distinct key, the root must split:

00                    is_leaf = 0                   ← became internal
01 00 00 00           nkeys   = 1
04 00 00 00           klen    = 4                   ← promoted middle key
… key bytes …
04 00 00 00           vlen    = 4
… val bytes …
01 00 00 00 …         left child (preorder)
01 00 00 00 …         right child (preorder)

The is_leaf byte changes from 01 to 00 precisely at the moment the root grows upwards. There is no other operation that flips this byte for the root.

What the matching sha256 proves

A single matching match(...) line proves that all three implementations agree on, at the byte level:

  1. The PRNG. Any drift in SplitMix64 would shuffle the key stream and the very first byte of the serialized tree would change.
  2. The lexicographic byte compare. Different ordering would re-route the descent at every internal node from key 4 onward.
  3. The proactive-split rule. Different split rules would produce different children counts and nkeys fields at every level above the leaves.
  4. The proactive-rebalance rule (Scenario B). The mixed scenario hits both insert and delete paths; the matching hash proves the borrow/merge logic agrees across all three.
  5. The preorder serializer with little-endian length prefixes. Different endianness or different node order would flip every single multi-byte field in the stream.

Any one of these going wrong, in any one of the three languages, makes the hashes diverge.

Sizes

Scenario B settles at exactly 2515 B for seed=7, ops=500, scenario=mixed. The Scenario A size varies but is also identical across all three languages (see the script output).

Spot-check rationale

The script greps the Rust scenario-A output for a known key prefix that must be inserted by SplitMix64(42)'s first few outputs. This guards against the silent-success regression where every language is "successfully" producing the same five-byte empty-tree header and nothing else.

db-10 — Verification

Prerequisites

  • macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
  • cmake ≥ 3.20.
  • Rust toolchain ≥ 1.74.
  • Go ≥ 1.22.
  • shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-10-btree-fundamentals
scripts/verify.sh        # unit tests, all three languages
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-10-btree-fundamentals/src/rust
cargo test --quiet
cargo build --release

Expected: all inline tests pass. The btreectl binary lands in target/release/btreectl.

Go

cd db-10-btree-fundamentals/src/go
go test ./...
go build ./cmd/btreectl

Expected: ok github.com/10xdev/dse/db10 <duration>.

C++

cd db-10-btree-fundamentals/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and the test_btree10 target prints OK.

What "green" means

A green run guarantees:

  • All inline unit tests pass in Rust, Go, and C++.

  • The cross-language test produces byte-identical serialized trees for both canonical scenarios:

    scenarioseedopssha256
    A inserts425004b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
    B mixed75009edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718

    Matching sha256 across three independent implementations proves agreement on the PRNG, the lexicographic compare, the proactive- split insert, the proactive-rebalance delete, and the precise tree shape after the workload.

  • The spot-check confirms the stream is non-empty and contains an expected key prefix, guarding against the regression where all three languages "successfully" produce the same five-byte empty-tree header.

When verification fails

  • Cross-language sha256 mismatch on the very first byteSplitMix64 divergence or wrong initial node is_leaf value.
  • Mismatch deep in the stream after matching headers — split or rebalance asymmetry; almost always a borrow-vs-merge decision that goes one way in two languages and the other in the third.
  • One language's scenario A matches but scenario B does not — a delete-path bug specific to that language. The inserts scenario never invokes delete, so it would not exercise the faulty path.
  • All three sha256s match each other but disagree with the baked-in expected hashes — a legitimate algorithm change. Make sure it was intentional, then update cross_test.sh and the table above in the same commit.

db-10 — Broader Ideas

Where this in-memory B-tree fits in the rest of the track, and which real-world techniques live one or two steps beyond it.

Immediate next labs

  • db-11 — Pager system. Wraps each node in a fixed-size disk page. Trades the heap-allocated Box<Node> recursion for a PageId-indexed page cache plus a free-list. Introduces the B+-tree layout (values only in leaves; leaves doubly linked for range scans) because internal nodes must fit one to a page.
  • db-12 — SQL frontend. Parses a small SQL subset (CREATE TABLE, INSERT, SELECT, UPDATE, DELETE), plans it into a B+-tree-backed table, and exposes a REPL.
  • db-13 — Transactions and MVCC. Versioned B+-tree pages so readers do not block writers. Snapshots are root-page references at a given commit timestamp.
  • db-14 — Indexes and query optimization. Secondary B-trees whose keys are (indexed_column, primary_key) pairs. Plans index scans, index-only scans, and merge joins.
  • db-15 — SQLite-complete. Everything above stitched into one executable; the B-tree track's counterpart to db-09.

How this lab's pieces show up in real systems

  • T = 2 "demo size" B-trees are exactly what every textbook uses as a teaching aid, including the one most engineers learn on. Production engines use T chosen to fit a 4 KiB / 8 KiB page, but the algorithms are unchanged.
  • Proactive split / rebalance is the standard discipline; the alternative (descend, then walk back up to fix overflows) is textbook for binary search trees but rare in B-trees because it makes concurrency control much harder.
  • Preorder canonical serialization is the same shape SQLite uses for its page_dump tooling and what RocksDB's sst_dump produces for its SSTables. Every storage engineer needs some byte-exact dump format; here we picked the simplest one that captures structure.
  • SplitMix64 is the standard hash-mixing primitive used by modern hash tables (Java 8 HashMap, Go's runtime-internal randn, and absl::flat_hash_map's perturbation). Using it for the workload generator means the keys we touch are realistically randomly distributed, not pathologically biased.

Performance experiments worth running later

  • Plot len() vs serialized size to see the per-key overhead at T = 2. Compare to T = 64 (db-11) to see how internal-node shrinkage from B+-tree leaves changes the breakdown.
  • Sweep KEY_SPACE from 100 up to 100 000 and watch the insert-delete-insert workload's steady-state size oscillate.
  • Replace the recursive serialize_tree with an explicit-stack iterative version and measure the wall-time gap. Useful prep before db-22.

What "production-quality" would require beyond this lab

  • Variable-length keys with prefix compression on the page.
  • Page-level checksums and a magic byte at offset 0 so a corrupted read fails loudly instead of returning random keys.
  • Free-list management for reclaimed pages after deletes (db-11).
  • Concurrent insert/delete protocols: latch coupling, optimistic lock coupling, or Bayer's "B-link tree" right-link technique for no-blocking traversal during split.
  • Copy-on-write pages so readers see a consistent snapshot during writes (LMDB-style).
  • A persistent "wal" record per page mutation so the tree can be replayed on recovery (db-03 / db-11).

None of these change the shape of the in-memory algorithms — they add policies on top of the same proactive-split / proactive- rebalance kernel built here.

db-10 step 01 — Tree shape and get / scan

Goal

Define the in-memory B-tree's node representation and implement the two read-only operations: point lookup get(k) and ordered scan() -> Vec<(k,v)>. No mutation yet; this step is about pinning the data structure.

Tasks

  1. Declare constants T = 2, MIN_KEYS = T - 1, MAX_KEYS = 2T - 1.
  2. Define a Node containing:
    • is_leaf: bool
    • keys: Vec<(Vec<u8>, Vec<u8>)> — sorted by key
    • children: Vec<Box<Node>> — empty for leaves; for internal nodes, children.len() == keys.len() + 1
  3. Define BTree { root: Box<Node> }. new() produces an empty leaf root.
  4. Implement get(&self, key: &[u8]) -> Option<Vec<u8>> by descending from the root: at each node, find the first key >= target; if equal, return its value; if leaf, return None; else recurse into children[i].
  5. Implement scan(&self) -> Vec<(Vec<u8>, Vec<u8>)> as the standard in-order traversal: for each i in 0..keys.len(), recurse into children[i], push keys[i]; finally recurse into the last child.
  6. Implement len() and is_empty() as helpers.

Acceptance

Inline unit tests:

  • get_on_empty_returns_noneBTree::new().get(b"k") == None.
  • manual_build_get_returns_value — manually construct a 3-key leaf, get returns the right value for each key and None for misses.
  • scan_is_sorted — manually construct an internal node with two leaf children; scan() returns the merged sorted sequence.

All three green in Rust, Go, and C++.

Discussion prompts

  • Why does get use linear scan over keys rather than binary search? For T = 2 the answer is obvious; for T = 256 is it still?
  • Why is is_leaf stored on each node rather than inferred from children.is_empty()?
  • What goes wrong if scan recurses into the last child before pushing the last key?

db-10 step 02 — Insert and delete (split, borrow, merge)

Goal

Implement mutation: insert(k, v) and delete(k) -> bool. Both operations must preserve the height invariant — every leaf at the same depth, every node within [MIN_KEYS, MAX_KEYS] (except the root).

Tasks

  1. Insert.
    • If root.keys.len() == MAX_KEYS, grow up: wrap the old root in a new internal root with one child, then split_child(new_root, 0). This is the only place height ever increases.
    • Then insert_nonfull(root, k, v).
  2. insert_nonfull(node, k, v).
    • If node.is_leaf: splice the entry into the sorted slot. If the key already exists, overwrite the value in place.
    • Else: find i such that key <= node.keys[i].0 (the child whose range covers the key). If children[i] is full, split first (pre-emptive split), then if key > node.keys[i].0 advance i. Recurse into children[i].
  3. split_child(parent, i). Precondition: parent.children[i].keys.len() == MAX_KEYS == 3. Effect:
    • Promote the middle key (index 1) into parent.keys[i].
    • Move the right half (key 2 plus children 2..=3) into a new sibling at parent.children[i + 1].
    • The left half (key 0 plus children 0..=1) remains in parent.children[i].
  4. Delete. Recursive delete_from(node, k) -> bool that maintains the invariant the node we're descending into has ≥ T keys. Three cases at a leaf or internal node hit:
    • Key in this leaf → splice out, return true.
    • Key in this internal node → replace with in-order predecessor or successor (drawn from whichever neighbor child has ≥ T keys), then recursively delete that pred/succ.
    • Key not in this subtree (descending into children[i]):
      • If children[i].keys.len() < T, borrow from children[i-1] or children[i+1] if one of them has > MIN_KEYS. Prefer left. Otherwise merge children[i] with its left or right sibling (prefer right if it exists, else left), pulling the separating key from the parent.
      • Recurse into children[i] (which is now safe).
  5. After delete_from returns, if root became a keyless internal node, collapse: root = root.children.remove(0). This is the only place height ever decreases.

Acceptance

Inline unit tests:

  • insert_then_get_roundtrip — insert 50 keys, all of them retrievable.
  • insert_overwrites — inserting ("k", "v1") then ("k", "v2") yields get("k") == "v2" and len() == 1.
  • delete_existing_returns_true — insert "k", delete "k" returns true, get("k") returns None.
  • delete_missing_returns_falseBTree::new().delete(b"k") is false.
  • inserts_grow_tree — insert enough keys to force at least one grow-up; check len() matches insertions.
  • deletes_shrink_tree — insert N keys then delete them all; len() goes to 0, tree is still well-formed (collapsed root).

All six green in Rust, Go, and C++.

Discussion prompts

  • Why is pre-emptive split preferred over "descend, recurse, split on the way back"?
  • For deletion, why must we ensure children[i].keys.len() >= T before descending, not after?
  • What's the tie-break rule when both siblings have spare keys — borrow from left or right? What's the cost of getting it wrong?
  • How would copy-on-write change split_child and delete_from?

db-10 step 03 — Serialize + CLI + cross-language byte-identity

Goal

Define a canonical wire format for the tree, build a btreectl CLI that runs a deterministic workload and writes the serialized tree to stdout, then prove via sha256 that all three implementations produce identical bytes for two distinct scenarios.

Wire format

Preorder traversal. Per node, in this exact order:

u8        is_leaf                  (1 = leaf, 0 = internal)
u32 LE    nkeys
nkeys *   { u32 LE klen, key bytes, u32 LE vlen, val bytes }
if !is_leaf:
    (nkeys + 1) * recurse(child_j)

All length prefixes are little-endian (matches every other lab in the workspace). The empty tree serializes as 01 00 00 00 00 (one empty leaf).

Deterministic workload

KEY_SPACE = 200

run_workload(scenario, seed, ops):
    rng = SplitMix64(seed)
    tree = BTree::new()
    for _ in 0..ops:
        r1 = rng.next_u64()
        r2 = rng.next_u64()                 # ALWAYS draw both
        key = (r1 % KEY_SPACE).to_be_bytes()  # u64 BE = 8 bytes
        val = (r2 as u32).to_be_bytes()       # u32 BE = 4 bytes
        match scenario:
            "inserts" : tree.insert(&key, &val)
            "deletes" : if i < ops/2: tree.insert(&key, &val) else: tree.delete(&key)
            "mixed"   : op = (r1 >> 62) & 0x3
                        0 | 1 -> insert ; 2 -> delete ; 3 -> skip
    return tree

Two PRNG draws per iteration is non-negotiable; if any implementation short-circuits the second draw on a skip branch, the seed → state mapping desyncs.

CLI contract

btreectl --seed N --ops M --scenario {inserts | deletes | mixed}

Writes the canonical wire-format bytes (no trailing newline) to stdout.

Tasks

  1. Add serialize_tree(&self) -> Vec<u8> to BTree. Pure function; does not mutate the tree.
  2. Implement the SplitMix64 PRNG with the standard constants (0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB).
  3. Implement run_workload per the spec above.
  4. Implement btreectl in Rust, Go, and C++.
  5. Write scripts/verify.sh that runs unit tests in all three langs.
  6. Write scripts/cross_test.sh that:
    1. Builds all three binaries.
    2. Scenario A: btreectl --seed 42 --ops 500 --scenario inserts → sha256 all three → assert match. Hash: 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088.
    3. Scenario B: btreectl --seed 7 --ops 500 --scenario mixed → sha256 all three → assert match. Hash: 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718.
    4. Spot-check that the stream contains an expected byte sequence (defensive against silent-empty regressions).
    5. Print === ALL OK ===.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
...
  match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
  match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== ALL OK ===

A byte-identical hash across three independent implementations for both scenarios is a near-proof that the PRNG, key/value encoding, insert path, delete path, and serialization format are all spec- compliant.

Discussion prompts

  • Why must we draw two PRNG outputs per iteration even when the scenario chooses to skip?
  • Why is the wire format preorder rather than level-order or in-order? What property does preorder preserve that the others lose?
  • If the Scenario-A hash matches but Scenario-B doesn't, what code paths are the prime suspects, and why?
  • The sha256s are baked into cross_test.sh as constants. What is the benefit, and what is the maintenance cost when the wire format legitimately evolves?

db-11 — Pager System

The first lab of the B-tree track where bytes leave RAM. db-10 built a B-tree out of Box<Node>s and proved three languages agreed on shape; this lab builds the substrate that turns those shapes into durable files. Every disk-backed engine in the series from here on — SQLite (db-15), MVCC (db-13), the distributed KV store (db-20), and the capstone (db-23) — sits on top of a pager. This is the component most production databases share.

What is it?

A pager is the layer that:

  1. Carves a file into fixed-size pages (we use 4 KiB by default; tests run with 256 B to keep dumps readable).
  2. Hands out pages by 1-based page id; page 0 is reserved for a file header that nails down the format.
  3. Maintains an in-memory page cache of bounded capacity, evicts with LRU, and writes dirty pages back to disk on eviction and on explicit flush().
  4. Calls fsync exactly when the user asks for durability, never on every write.

The interface is intentionally minimal:

open(path, page_size, cache_capacity) -> Pager
Pager::allocate() -> PageId            // grow file by one page
Pager::read(pid)  -> Vec<u8>           // page_size bytes
Pager::write(pid, bytes)               // bytes.len() == page_size
Pager::flush()                         // write all dirty + fsync
Pager::close()

No B-tree nodes, no records, no keys. The B+-tree in db-15 will encode those structures into the page bytes; the pager neither knows nor cares what the bytes mean.

Why does it matter?

  • The cache is the database. Every production engine spends most of its time hitting a buffer pool, not reading disk. The LRU policy, the dirty bit, and the eviction discipline are the difference between "fits in RAM, fast" and "thrashes, dead".
  • The file layout is a binding contract. Once two implementations agree on byte 0 of every page, the database is portable across languages and platforms. db-15 will reuse this contract; the cross-language hash test in this lab proves it holds before the B+-tree code ever runs.
  • fsync is the only thing that buys durability. Every other write is just a hint to the OS. Knowing exactly when fsync runs (and why) is what separates working systems from data-loss outages.

How does it work?

File layout

offset 0                            offset = N * page_size
┌─────────────────────────┬─────────────┬─────────────┬─────┐
│  page 0 (header)        │   page 1    │   page 2    │ ... │
│  magic | psz | npages   │ user bytes  │ user bytes  │     │
│  + zero-pad to page_size│             │             │     │
└─────────────────────────┴─────────────┴─────────────┴─────┘

Page 0 is 24 bytes of header + zero-padding:

offsetsizefieldvalue
016magic"DSE-PAGER-v1\0\0\0\0" (ASCII + NULs)
164page_sizeu32 little-endian
204num_pagesu32 little-endian (includes page 0)
24restzerospadding to page_size

num_pages is the durable page count — what the file claims after fsync. The in-memory pager may have allocated pages beyond that which have not been flushed yet; close()/flush() reconcile them.

Cache, in pictures

        cache_capacity = 3,   recent = [pid=5] [pid=2] [pid=7]
                                MRU             LRU

  read(5)   →   hit, promote 5 to head        [5] [2] [7]
  write(9)  →   miss, evict 7 (writeback)     [9*] [5] [2]      ← 9 dirty
  read(2)   →   hit, promote 2                [2] [9*] [5]
  flush()   →   write 9, fsync                [2] [9 ] [5]

Each frame in the cache carries:

  • pid: u32
  • data: Vec<u8> of length page_size
  • dirty: bool
  • linked-list pointers (prev / next) into the LRU chain

The lookup table is a hashmap pid → frame_index (Rust) or pid → *list.Element (Go) or pid → list iterator (C++). All three give O(1) lookup; promotion to MRU is O(1) doubly-linked-list splice.

Read path

read(pid):
    if pid == 0 or pid > num_pages_in_memory: panic
    if pid in cache:
        promote cache[pid] to MRU
        cache_hits += 1
        return clone of cache[pid].data
    else:
        cache_misses += 1
        if cache is full:
            evict tail; if dirty, pwrite then mark clean
        buffer = pread(page_size bytes at offset pid * page_size)
        insert (pid, buffer, dirty=false) at MRU
        return clone

Write path

write(pid, bytes):
    assert bytes.len() == page_size
    if pid in cache:
        cache[pid].data = bytes
        cache[pid].dirty = true
        promote to MRU
    else:
        if cache is full: evict tail with write-back as above
        insert (pid, bytes, dirty=true) at MRU       ← no read!

The "write-without-read" path is the optimization that makes bulk loads cheap. A B+-tree splitting a leaf allocates a fresh page and writes the whole 4 KiB at once; reading the old (uninitialized) contents first would double I/O for nothing.

Allocate

allocate():
    pid = num_pages_in_memory
    num_pages_in_memory += 1
    return pid                       (1-based; pid 0 is the header)

The file is extended lazily — only when the page is actually written back (either via eviction or flush). This means a sequence of allocate(); allocate(); allocate() without writes never touches disk, which matters for transactions that roll back.

Flush

flush():
    rewrite page 0 with current num_pages
    for each cached page in ascending pid order:
        if dirty: pwrite at offset pid*page_size; mark clean
    fsync

Sorting by pid before write turns N scattered seeks into one sequential pass on a spinning disk. On SSDs the win is smaller but still real (TLB-friendly access pattern; predictable readahead).

Determinism

The lab's verification depends on every operation being deterministic given the seed, the workload, and the cache capacity. Two things that look like they could leak nondeterminism but do not:

  • HashMap iteration order. We never iterate the cache map; the flush loop sorts dirty frames by pid first.
  • fsync timing. fsync does not change the byte contents of the file, only their visibility after a crash. The sha256 we compare is taken from the post-flush file, which is fully determined.

Where this fits

  • Upstream: none directly; the pager is a from-scratch component.
  • Downstream: db-12 (SQL frontend storage), db-13 (MVCC over snapshot page versions), db-14 (secondary index B+-trees over the pager), db-15 (SQLite-complete), db-21 (advanced storage variants), and every distributed lab from db-16 onwards stores state on a pager-backed file.

db-11 — References

Primary sources

  • SQLite Pager design notes — the cleanest public description of a production pager, including how it interacts with rollback journals and WAL. The architecture of the db-11 pager is a deliberate simplification of this design. https://www.sqlite.org/atomiccommit.html https://www.sqlite.org/walformat.html
  • LMDB / mdb design — Howard Chu, MDB: A Memory-Mapped Database and Backend for OpenLDAP. Describes a B+-tree pager whose write path is copy-on-write rather than write-back. Useful counterpoint to the LRU + dirty-bit approach we took. https://www.symas.com/symas-embedded-database-lmdb
  • Goetz Graefe, Modern B-Tree Techniques, Foundations and Trends in Databases 3(4), 2010. Chapter 2 covers buffer-pool management and the page-eviction policies real engines use.

Operating-systems background

  • Andrew S. Tanenbaum, Modern Operating Systems, 4th ed., chapter on file systems and page caches. The OS's own page cache is conceptually our cache; understanding pread/pwrite/fsync at the kernel level explains why "writing" without fsync is not durable.
  • fsync(2) man page — the canonical answer to "what does fsync actually guarantee?" Read this before assuming a write reached disk.
  • Eduardo Pinheiro et al., Failure Trends in a Large Disk Drive Population, FAST 2007. Sobering reminder that the device under the pager does fail; durability is a probabilistic claim.

Replacement policies

  • Elizabeth O'Neil, Patrick O'Neil, Gerhard Weikum, The LRU-K Page Replacement Algorithm For Database Disk Buffering, SIGMOD 1993. Why naive LRU thrashes on scan-heavy workloads, and the fix everyone borrowed.
  • Theodore Johnson, Dennis Shasha, 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm, VLDB 1994. The 2Q policy used by Postgres and several others.
  • The db-11 implementation deliberately uses textbook LRU. db-22 (performance) will measure when this hurts and what 2Q / CLOCK / ARC buy.

Production engines whose pager you can read

Cross-lab dependencies

  • Upstream: none. The pager is a from-scratch component.
  • Downstream: db-12 (SQL frontend), db-13 (MVCC), db-14 (indexes), db-15 (SQLite-complete), db-20 (distributed KV) all store state on top of a pager file in this format.

db-11 — Analysis

What had to be decided before any code was written, and why each choice locks in trade-offs the rest of the B-tree track will pay for or be paid by.

Required invariants

  1. File layout is canonical. Byte 0..15 of page 0 is the magic string DSE-PAGER-v1\0\0\0\0; bytes 16..19 are page_size LE; bytes 20..23 are num_pages LE; bytes 24..page_size-1 are zero. Any pager implementation that produces or consumes a file must agree on these bytes to the bit.
  2. Cache capacity is hard. After every operation, the number of resident frames is <= cache_capacity. The eviction path maintains this invariant before admitting a new frame, never after.
  3. Dirty pages survive eviction. If a page is evicted while dirty == true, its bytes are written to disk before the frame is reused. The cache may evict at any time; a dropped dirty page is a data-loss bug.
  4. Determinism. Given (path, page_size, cache_capacity, seed, ops, scenario), the post-flush file bytes are a pure function of those inputs. Two languages running the same workload must produce sha256-identical files.
  5. Page 0 is reserved. User code receives only pid >= 1 from allocate(). read(0) / write(0) is undefined behaviour (panic in Rust; documented but unenforced in Go/C++).

Design decisions

Why a 16-byte magic instead of 8

8 bytes (e.g. DSEPAGER) would have fit one register and saved 8 bytes per file. 16 bytes lets us include a version suffix and a human-readable prefix that shows up in strings(1). The cost is trivial; the debugging payoff (file db.bin | grep DSE) is immediate.

Why fixed page size at open() rather than per-page

A real engine fixes page size when the database file is created and refuses to mount it under a different page size. We bake this in by writing the page size into the header and re-reading it on open. The cost: changing page size means rewriting the file. The gain: no per-page metadata, no alignment surprises, page offsets are just pid * page_size.

Why 1-based page ids

Page 0 is the header. Letting allocate() return 0 would force every caller to remember the "0 is reserved" rule and to check it on every dereference. By starting allocation at 1, the contract is enforced by construction: any pid you legitimately hold is safe to read.

Why LRU (and not CLOCK, 2Q, ARC, LFU, …)

LRU is the textbook policy and the easiest one to verify deterministic across three languages. Its weakness — sequential scans flush a hot working set — is real but invisible at the cache sizes our tests use (capacity 8 over 100 pages). db-22 will revisit and measure; until then, simplicity dominates.

Why a doubly-linked list, not a BTreeMap<LastUsed, PageId>

A balanced map gives O(log n) operations and self-orders by recency. A doubly-linked list plus a hashmap gives O(1) operations and the same eviction order, at the cost of one extra pointer per frame. For a cache of 1000 frames the difference is ~10x in cache hit latency. Worth the boilerplate; LMDB, Postgres, SQLite, RocksDB, InnoDB all use the list-plus-map shape.

Why write-back, not write-through

Write-through (every write() synchronously persists) is simpler but makes random updates ~100x slower because every dirty page costs a seek and an fsync. Write-back lets us batch many writes to the same page (db-10's B-tree insert may rewrite the same node several times during a single workload) and amortize one disk write per page per flush. The tax is the dirty-page accounting, which is enforced by invariant 3 above.

Why fsync only on flush()

The pager's user owns the durability story. SQLite calls flush at every COMMIT; an LSM (db-05) calls it after every WAL append; an embedded counter store might call it once a minute. Pushing the decision up keeps the pager honest: it never claims durability it cannot deliver. The cross-test scenarios all call flush() exactly once at the end, which is why their hashes are stable.

Why write-without-read on a cache miss

If write(pid, bytes) evicts a clean page and admits (pid, bytes, dirty=true) without first reading the old contents, the disk's bytes are overwritten entirely on the next eviction or flush. This is safe because write requires bytes.len() == page_size — the whole page is supplied. Reading the old contents first would be a 4 KiB I/O for data we throw away immediately. A proper engine extends this with "page allocation hints" so that the OS can skip the readahead too; we don't bother.

Why the workload uses SplitMix64 (the same PRNG as db-10)

Three reasons:

  1. Identical implementation across languages. Three lines of wrapping-add and xor-mul; if any language gets it wrong the sha256 changes on the very first scenario.
  2. No external dependency. Crypto-quality PRNGs would need different libraries per language; SplitMix64 is purely arithmetic.
  3. Consistency across the track. Reusing the same PRNG as db-10 means a future cross-lab test can compare hashes from "B-tree built in RAM" against "B-tree built on the pager" using the same key sequence.

The PRNG draws exactly one u64 per iteration and uses specific bit-slices for op/byte/pid. A variable number of draws per iteration would make scenarios diverge in their key streams, which would defeat purpose 3.

Why the scenarios are sequential / random / mixed

  • sequential stresses the readahead-friendly path: page ids walk in monotonic order, cache hits dominate, evictions are predictable.
  • random stresses the eviction path: cache hit ratio is the cache_capacity / num_pages ratio, evictions happen on most writes, dirty pages move through the cache constantly.
  • mixed is what real workloads look like: a hot subset (selected by (r>>60)&1) plus a long tail of cold pages.

These three together exercise the entire cache state machine. If any of them diverges across languages, the bug is localized (sequential bugs are accounting; random bugs are eviction; mixed bugs are interaction).

Tradeoffs worth flagging

  • No free-list, ever. allocate() only grows the file. Once a B+-tree splits a page and later coalesces it, the now-unused page id is leaked. db-21 (storage engine advanced) will reclaim via a free-list page; here it would just be unverifiable code.
  • Vec<u8> per frame. Every cached page is its own allocation. A real engine packs frames into a single arena (the buffer pool) and indexes by offset. db-22 will measure the difference and likely arena-allocate.
  • No checksums. A corrupted page returns its corrupted bytes silently. db-15 will add a CRC32 to the page footer when SQLite semantics demand it.
  • No mmap. mmap-backed pagers (LMDB) are dramatically simpler but inherit the OS's page-replacement decisions, which we want to control here for testability. db-21 may explore the mmap variant.
  • Single-threaded. No latching, no per-page reader/writer locks. db-13 (MVCC) and db-17 (Raft) will introduce concurrency on top of this layer.

db-11 — Execution

What was built, in the order it was built.

1. Rust (src/rust)

  • Cargo.toml declares lib crate pager11 and binary pagerctl. Edition 2021; release profile lto = "thin", codegen-units = 1.
  • src/lib.rs contains:
    • Constants MAGIC: &[u8;16] = b"DSE-PAGER-v1\0\0\0\0", HEADER_LEN = 24.
    • Frame { pid, data: Vec<u8>, dirty, prev: Option<usize>, next: Option<usize> } plus Pager { file, page_size, num_pages, capacity, frames: Vec<Frame>, free: Vec<usize>, map: HashMap<u32,usize>, head/tail: Option<usize>, hits, misses }.
    • Pager::open(path, page_size, capacity), ::allocate(), ::read(pid), ::write(pid, bytes), ::flush(), ::cache_hits(), ::cache_misses(), ::num_pages().
    • LRU helpers promote(frame_idx), evict_tail(), admit(...) operating on the indexed doubly-linked list.
    • Hand-rolled SHA-256 (FIPS 180-4) so the lib has no dependencies. sha256_hex(bytes) and sha256_file(path).
    • SplitMix64 PRNG and run_workload(path, page_size, capacity, pages, ops, seed, scenario) -> Pager.
    • 10 inline #[cfg(test)] tests: header round-trip, allocate monotonic, read-after-write within and across eviction, dirty survives eviction, flush is idempotent, hits/misses counts, scenario determinism (sequential), scenario determinism (random), scenario determinism (mixed), SHA-256 empty-string test vector.
  • src/bin/pagerctl.rs: order-independent arg parser (args.windows(2) lookup). Subcommands init <path> [--page-size N] and workload <path> --seed S --ops N --pages P --cache C --scenario {sequential|random|mixed} [--page-size N]. Workload prints sha256_file(path) to stdout with no trailing newline.

2. Go (src/go)

  • go.mod module github.com/10xdev/dse/db11, Go 1.22.
  • pager.go ports the Rust API one-for-one. Uses container/list for the LRU chain and map[uint32]*list.Element for lookup. SHA-256 via crypto/sha256 (stdlib is fine; the cross-language comparison is on the file bytes, not the digest algorithm).
  • pager_test.go mirrors the 10 Rust tests plus an 11th, TestWorkloadMatchesCanonicalHashes, that bakes in the three canonical hashes (A/B/C) and runs all three scenarios in a loop. This is the test that catches "Go silently disagrees with Rust" regressions before the cross_test script even runs.
  • cmd/pagerctl/main.go is the matching CLI. Custom flag parser (findFlag, firstPositional, mustU64, mustInt) because flag.Parse stops at the first non-flag argument and the shared script passes <path> before the flags.

3. C++ (src/cpp)

  • CMakeLists.txt builds:
    • pager11 (static lib from src/pager.cc + src/sha256.cc).
    • pagerctl (executable linking pager11).
    • test_pager11 (ctest target linking pager11).
    • Flags: -Wall -Wextra -Wpedantic -Werror -O3 -DNDEBUG in Release.
  • src/pager.h, src/pager.cc: factory function Pager::open(...) returning a std::unique_ptr<Pager>. std::list<Frame> for the LRU chain; std::unordered_map<uint32_t, std::list<Frame>::iterator> for O(1) lookup. std::list::splice for promotion.
  • src/sha256.h, src/sha256.cc: FIPS 180-4 SHA-256 in ~120 lines.
  • src/pagerctl.cc: matching CLI. Includes <unistd.h> for getpid() (used by tests for unique tmp paths); the omission of that header was the only build error during initial bring-up.
  • tests/test_pager11.cc mirrors the Rust tests; uses #undef NDEBUG before <cassert> so asserts fire under Release. Prints OK 11 tests on success. Wired into ctest as a single test case.

4. Scripts

  • scripts/verify.sh:
    1. Rust: cargo test --quiet.
    2. Go: go test ./....
    3. C++: cmake -S … -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j && ctest --test-dir build --output-on-failure.
    4. Exits 0 only if all three are green; prints === OK ===.
  • scripts/cross_test.sh:
    1. Builds Rust/Go/C++ pagerctl binaries (cargo release, go build, cmake+make).
    2. Scenario A sequential, seed=42, pages=32, cache=8, ops=200, page_size=256: pagerctl workload <tmp> … per language; sha256
      • size comparison against baked-in expected hash.
    3. Scenario B random, seed=7, pages=64, cache=8, ops=500, page_size=256: same shape, different hash.
    4. Scenario C mixed, seed=2024, pages=128, cache=16, ops=1000, page_size=512: same shape, different hash.
    5. Spot-check: read the first 20 bytes of Scenario A's file and assert they equal 4453452d50414745522d76310000000000010000 (magic DSE-PAGER-v1\0\0\0\0 + 0x00000100 for page_size = 256 LE).
    6. Print === ALL OK ===.

What was deliberately not built

  • Free-list / page reclamation. allocate() only grows the file. db-21 (storage engine advanced) introduces a free-list page.
  • Page checksums. No CRC32 footer. db-15 will add one when SQLite-compatibility demands it.
  • mmap backend. All I/O goes through pread/pwrite. An mmap-based variant is a possible db-21 follow-up.
  • Concurrency. No latches; the pager assumes a single thread. db-13 (MVCC) and db-17 (Raft) introduce concurrent access at higher layers.
  • WAL. db-11's pager has no WAL; durability is via in-place write + fsync at flush(). db-03 already covered WAL and db-13 will add a transactional WAL on top of the pager.
  • Compression / encryption. Out of scope; the page bytes are whatever the caller wrote.

db-11 — Observation

What the cross-language verification actually proves, and what the file looks like by hand.

Output of scripts/cross_test.sh

=== compare Scenario A (sequential seed=42 pages=32 cache=8 ops=200 ps=256) ===
  A          rust=cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 (    8448 B)
  A          go  =cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 (    8448 B)
  A          cpp =cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 (    8448 B)
  match(A): cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428
=== compare Scenario B (random seed=7 pages=64 cache=8 ops=500 ps=256) ===
  B          rust=3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 (   16640 B)
  B          go  =3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 (   16640 B)
  B          cpp =3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 (   16640 B)
  match(B): 3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023
=== compare Scenario C (mixed seed=2024 pages=128 cache=16 ops=1000 ps=512) ===
  C          rust=5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 (   66048 B)
  C          go  =5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 (   66048 B)
  C          cpp =5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 (   66048 B)
  match(C): 5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460
=== spot-check header ===
  spot-checks ok
=== ALL OK ===

File sizes are exactly (pages + 1) * page_size:

  • A: (32 + 1) * 256 = 8448
  • B: (64 + 1) * 256 = 16640
  • C: (128 + 1) * 512 = 66048

The +1 is page 0 (header).

Reading the header by hand

For Scenario A (page_size = 256, num_pages = 33 including the header):

xxd -l 24 /tmp/pager-A.rust.bin
00000000: 4453 452d 5041 4745 522d 7631 0000 0000  DSE-PAGER-v1....
00000010: 0001 0000 2100 0000                      ....!...

Decoded:

bytesmeaning
44 53 45 2d 50 41 47 45 52 2d 76 31 00 00 00 00magic DSE-PAGER-v1\0\0\0\0
00 01 00 00page_size = 0x00000100 = 256 (LE)
21 00 00 00num_pages = 0x00000021 = 33 (LE)

Bytes 24..255 are zero (header padding to page_size).

The cross-test's spot-check confirms bytes 0..19 exactly equal 4453452d50414745522d76310000000000010000. Any single-byte change to the format would surface here, and would break the sha256 match across all three languages, and would change the file size, and would invalidate the canonical hashes table — four independent failure signals for one bug.

Reading a data page by hand

For Scenario A the workload writes a known byte value B = (r >> 24) & 0xFF to every byte of the chosen page. So any non-zero data page in /tmp/pager-A.rust.bin should be 256 identical bytes:

xxd -s 256 -l 256 /tmp/pager-A.rust.bin | head -2
00000100: 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c  ................
00000110: 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c  ................

A run of one byte value repeated 256 times. Different pages contain different fill bytes; the sha256 of the file rolls all of them up. This makes hand-debugging a divergence between languages straightforward: dump both files, diff -u <(xxd a) <(xxd b), and the first non-matching page tells you exactly which (pid, byte) the languages disagreed on.

Cache statistics (informal)

Running Scenario B with cache = 8 over pages = 64:

hits   = ~190
misses = ~310

Hit ratio ~38%, consistent with the random scenario's expected cache_capacity / num_pages baseline (8 / 64 = 12.5%) plus a small temporal-locality bump. The Rust unit tests assert hits + misses == ops but not the exact ratio, because writes that bypass reads (write-on-miss admission) keep the absolute counts implementation-defined enough that an exact check would be fragile. The file bytes, however, are not implementation-defined — and that is what we pin.

db-11 — Verification

Prerequisites

  • macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
  • cmake ≥ 3.20.
  • Rust toolchain ≥ 1.74.
  • Go ≥ 1.22.
  • shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-11-pager-system
scripts/verify.sh        # unit tests, all three languages
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-11-pager-system/src/rust
cargo test --quiet
cargo build --release

Expected: all 10 inline tests pass; target/release/pagerctl is built.

Go

cd db-11-pager-system/src/go
go test ./...
go build ./cmd/pagerctl

Expected: ok github.com/10xdev/dse/db11 <duration>. The TestWorkloadMatchesCanonicalHashes test is the most important; it fails loudly if Go disagrees with Rust on any of the three scenarios.

C++

cd db-11-pager-system/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and the test_pager11 target prints OK 11 tests.

What "green" means

A green run guarantees:

  • All inline unit tests pass in Rust, Go, and C++.

  • The cross-language test produces byte-identical files for all three canonical scenarios:

    scenariotypeseedpagescacheopspszsha256
    Asequential42328200256cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428
    Brandom76485002563405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023
    Cmixed20241281610005125b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460

    Matching sha256 across three independent implementations proves agreement on:

    • the file format (header magic, page_size, num_pages encoding),
    • the SplitMix64 PRNG (constants and bit-extraction layout),
    • the workload state machine (op/pid/byte selection),
    • the cache admission rule (write-on-miss admits without read),
    • the eviction rule (LRU tail, dirty pages written back),
    • the flush order (dirty pages sorted by pid before write),
    • and the final on-disk page layout.
  • The spot-check confirms the first 20 bytes of Scenario A's file are 4453452d50414745522d76310000000000010000 (magic + page_size = 256 LE), guarding against the regression where all three languages "successfully" agree on a wrong header.

When verification fails

  • Cross-language sha256 mismatch on Scenario A only — the sequential scenario exercises the simplest code path (no random pid selection, predictable evictions). A failure here is almost always either:
    • the magic / header encoding (check the spot-check first), or
    • the SplitMix64 PRNG (re-derive the first 5 outputs by hand and compare against 0xe220a8397b1dcdaf, …).
  • Scenario A matches, B fails — the random scenario stresses eviction. Look for off-by-one in LRU tail selection or for a language whose unordered_map iteration leaks into the flush order (it should not; flush sorts by pid).
  • A and B match, C fails — the mixed scenario uses a larger cache and a larger page size; suspect a page-size assumption baked into the implementation (e.g., a hard-coded 256 instead of reading from the header).
  • All three sha256s match each other but disagree with the table above — a legitimate algorithm change. Make sure it was intentional, then update cross_test.sh, the Go canonical-hashes test, the C++ canonical-hashes assertion, and the table above in the same commit.
  • One language's unit tests pass but cross_test fails — almost always a CLI bug, not a library bug. The unit tests drive the library directly; the cross_test drives the binary through the shell. Double-check argument parsing: in particular, that <path> may appear before the --flags (this is the bug the Go port hit during bring-up, fixed by the custom findFlag/firstPositional parser).

db-11 — Broader Ideas

Where this disk-backed pager fits in the rest of the track, and which real-world techniques live one or two steps beyond it.

Immediate next labs

  • db-12 — SQL frontend. The first consumer of the pager. A row in a table becomes some bytes inside some page; an INSERT is a Pager::write. The B+-tree layer that maps rows-to-pages is built in db-15 but its scaffolding starts here.
  • db-13 — Transactions and MVCC. Each transaction sees a consistent snapshot of the pager. The simplest implementation is copy-on-write at the page level: a write conceptually allocates a new page rather than mutating the old, and snapshots hold roots pointing at the version they read. Our pager's monotonic allocate() is the right primitive for this.
  • db-14 — Indexes. Secondary indexes are additional B+-trees living in the same pager file as the primary. Multiple trees, one pager, one buffer pool.
  • db-15 — SQLite-complete. Stitches db-10..db-14 together. Will add page checksums, the rollback journal or WAL, and the free-list page so that deleted pages don't leak.
  • db-21 — Storage engine advanced. Revisits this pager with CLOCK / 2Q eviction, a freelist, an mmap variant, and possibly a group-commit fsync scheduler.
  • db-22 — Performance and benchmarking. Measures hit ratio, eviction rate, and fsync cost under realistic workloads; compares LRU against alternative policies.

How this lab's pieces show up in real systems

  • The 4 KiB page is the de-facto default in every major engine (Postgres, SQLite, InnoDB, RocksDB SST blocks). It matches both the typical filesystem block size and the Linux page-cache granule, which means partial pages cost no extra readahead.
  • The header-on-page-0 trick is universal: SQLite, BoltDB, InnoDB, even Berkeley DB all reserve page 0 for metadata.
  • Write-back with LRU is the classic buffer-pool design; Postgres calls it shared_buffers, InnoDB calls it innodb_buffer_pool_size, SQLite calls it the page_cache. Our implementation is the textbook version they all started from.
  • fsync-only-on-flush is the contract every transactional engine demands of its pager: the WAL or rollback journal layer above decides when, the pager just provides the primitive. The DBMS literature calls this the "no-force" policy.
  • The doubly-linked-list + hashmap LRU is the pattern in every production buffer pool — Postgres's BufferLookup, InnoDB's buf_LRU, RocksDB's LRUCache, even your CPU's L2 replacement logic. The textbook is real.

Variants worth implementing later

  • CLOCK replacement — a single circular array with a reference bit per frame. Approximates LRU at lower overhead because there's no list splice per access. PostgreSQL uses this.
  • 2Q — two LRU lists, one for "seen once" and one for "seen twice or more". Resists scan-induced cache pollution. Cheap to implement on top of the existing LRU code.
  • ARC (Adaptive Replacement Cache) — IBM's adaptive variant of 2Q. Patented but reimplementable.
  • Copy-on-write pages (LMDB-style) — every write allocates a fresh page; old versions stay live for concurrent readers. Trades higher write amplification for free MVCC.
  • mmap-backed pagermmap the whole file, let the OS manage the page cache. Drastically simpler code; loses control over eviction and durability.

Performance experiments worth running later

  • Plot hit ratio vs cache_capacity / num_pages for each scenario. Expect a knee around 25..50% for the mixed scenario.
  • Measure the cost of flush() as a function of dirty-page count. Sorted writes should be sub-linear vs unsorted on spinning disk.
  • Compare write-back vs write-through latency for a steady stream of small updates. The write-back win should be order-of-magnitude on any device with non-trivial fsync cost.
  • Vary page_size from 256 B to 64 KiB. The hit ratio improves with smaller pages (finer caching granule) but per-operation bookkeeping cost grows.

What "production-quality" would require beyond this lab

  • Crash recovery. Right now, a crash in the middle of a flush leaves a half-written page on disk and no way to detect it. SQLite uses a rollback journal; Postgres uses WAL + a checkpointer. db-13 will introduce the simplest form of this.
  • Checksums. A CRC32 footer per page so torn writes are detectable, not silently returned to the caller.
  • A free-list page so deleted pages can be reused, otherwise files grow monotonically.
  • Concurrent access. Reader-writer latching at the page level so the pager scales to multiple threads.
  • Direct I/O / O_DIRECT to bypass the OS page cache and prevent double-buffering. Needed at high throughput; subtle to get right.
  • Async I/O. io_uring on Linux, IOCP on Windows. The synchronous pread/pwrite we use is fine for teaching and for any workload where the database is not the bottleneck.

db-11 step 01 — Page I/O and file layout

Goal

Build the bottom half of the pager: the file format and the uncached read / write / allocate path. No cache, no LRU, no eviction. Every read is a pread; every write is a pwrite; flush is just fsync.

Tasks

  1. Define MAGIC = b"DSE-PAGER-v1\0\0\0\0" (16 bytes) and HEADER_LEN = 24.
  2. Implement Pager::open(path, page_size, capacity):
    • If file does not exist or is empty, create it; write a fresh header page (magic + page_size + num_pages=1, zero-padded to page_size); fsync.
    • If file exists, read bytes 0..24, validate magic, parse page_size and num_pages. The caller-supplied page_size argument must match the on-disk value (or be supplied as the authoritative size on creation).
  3. Implement Pager::allocate() -> u32:
    • return num_pages, then num_pages += 1. The on-disk file is not yet extended — the next flush() will rewrite page 0 and the new page will materialise then.
  4. Implement Pager::read(pid) -> Vec<u8> (no caching yet):
    • validate 1 <= pid < num_pages.
    • pread(page_size bytes at offset pid * page_size).
  5. Implement Pager::write(pid, bytes) (no caching yet):
    • validate bytes.len() == page_size.
    • validate 1 <= pid < num_pages.
    • pwrite(bytes at offset pid * page_size).
  6. Implement Pager::flush():
    • rewrite page 0 with current num_pages (handles allocate-only transactions).
    • fsync.
  7. Implement Pager::close():
    • flush() then drop the file handle.

Acceptance

Inline unit tests:

  • header_round_trip — open new file, close, reopen, assert num_pages == 1 and the magic is intact.
  • allocate_monotonic — three allocate() calls in a row return 1, 2, 3.
  • write_then_read_same_pager — allocate, write a known byte pattern, read it back, assert equal.
  • write_then_reopen_then_read — allocate, write, flush(), drop, reopen, read; bytes survived.
  • flush_extends_file — after allocate + write + flush, file size equals (num_pages) * page_size.

All three green in Rust, Go, and C++.

Discussion prompts

  • Why is num_pages stored on page 0 rather than inferred from the file size? (Hint: what happens between allocate() and flush() if the OS crashes?)
  • What goes wrong if open() is called concurrently from two processes on the same file?
  • Why does flush() rewrite page 0 even if no data page changed?

db-11 step 02 — LRU cache with write-back

Goal

Add the in-memory page cache on top of step 01. Bounded capacity, LRU eviction, write-back on eviction, dirty bit per frame. After this step the disk is touched only on cache misses, evictions, and flush().

Tasks

  1. Define Frame { pid: u32, data: Vec<u8>, dirty: bool, prev, next } where prev/next are indices into a Vec<Frame> (Rust) or *list.Element (Go) or std::list<Frame>::iterator (C++).
  2. Add to Pager:
    • capacity: usize — set at open().
    • frames: Vec<Frame> — the storage backing the LRU chain.
    • free: Vec<usize> — reusable indices after eviction.
    • map: HashMap<u32, usize> — pid → frame index.
    • head, tail: Option<usize> — MRU and LRU ends.
    • hits, misses: u64 — accounting.
  3. Helpers:
    • promote(idx) — unlink frame from current position, insert at head. No-op if already at head.
    • unlink(idx) — remove frame from the list.
    • evict_tail() — pop the LRU frame; if dirty, pwrite before reusing the slot.
    • admit(pid, data, dirty) — if at capacity, evict_tail first; allocate a frame (from free or push new); insert at head; update map.
  4. Rewrite read(pid):
    • if map[pid] exists: promote, hits += 1, clone, return.
    • else: misses += 1, pread, admit(pid, data, dirty=false), clone, return.
  5. Rewrite write(pid, bytes):
    • if map[pid] exists: overwrite data, set dirty = true, promote.
    • else: admit(pid, bytes, dirty=true)no pread.
  6. Rewrite flush():
    • collect all (pid, frame_idx) where dirty == true.
    • sort by pid ascending.
    • for each, pwrite at pid * page_size, set dirty = false.
    • rewrite page 0 with current num_pages.
    • fsync.

Acceptance

Inline unit tests:

  • cache_hit_does_not_pread — write then read twice; second read produces a cache hit (cache_hits >= 1).
  • eviction_writes_back_dirty — fill cache + 1, evict the oldest frame, drop the pager, reopen, read the evicted pid, bytes match the value written before eviction.
  • eviction_skips_clean_pages — fill cache with only-reads, evict, reopen: file size unchanged (no spurious writes).
  • flush_is_idempotent — flush twice in a row, file bytes identical, both succeed.
  • hits_misses_accounting — for a known sequence of operations, cache_hits + cache_misses equals the number of read calls (writes that hit the cache are not counted as reads).

All three green in Rust, Go, and C++.

Discussion prompts

  • Why does write on a miss not do a pread? When could this be wrong? (Answer: never, as long as the caller writes the whole page. Partial-page writes would need read-modify-write.)
  • Why sort dirty pages by pid before writing them out?
  • What is the worst-case eviction cost, and how could evict_tail amortize fsyncs across many evictions?

db-11 step 03 — Cross-language byte agreement

Goal

Pin the file format. After this step a workload run in Rust, Go, or C++ produces sha256-identical files for the same inputs. This is what makes the pager a real cross-language contract, not three loosely-related implementations.

Tasks

  1. Implement SplitMix64 exactly:

    next(state):
        state += 0x9E3779B97F4A7C15            // wrapping
        z = state
        z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
        z = (z ^ (z >> 27)) * 0x94D049BB133111EB
        return z ^ (z >> 31)
    

    All multiplies are wrapping u64. Test against a known first-output table (seed = 0 yields 0xE220A8397B1DCDAF etc.).

  2. Implement run_workload(path, page_size, capacity, pages, ops, seed, scenario):

    pager = Pager::open(path, page_size, capacity)
    while pager.num_pages() < pages + 1:
        pager.allocate()
    rng = SplitMix64(seed)
    for _ in 0..ops:
        r = rng.next()
        op       = (r >> 62) & 0b11           // 0,1,2,3
        byte_val = (r >> 24) & 0xFF
        pid = match scenario:
            sequential -> 1 + (iteration % pages)
            random     -> 1 + (r as u64 % pages)
            mixed      -> if (r >> 60) & 1 then random_pid else sequential_pid
        match op:
            0 | 1 -> write a page of [byte_val; page_size]
            2     -> read pid (discard result)
            3     -> skip
    pager.flush()
    return pager
    

    Critical: each iteration consumes exactly one next() call. This is what keeps the three scenarios comparable for a given seed.

  3. Build a pagerctl CLI in each language with subcommands init and workload. workload runs the function above and prints sha256_file(path) in lowercase hex with no trailing newline to stdout. The CLI must accept <path> either before or after the --flags — the cross-test passes path first; some contributors will pass it last.

  4. Write scripts/cross_test.sh:

    • build all three binaries (cargo release, go build, cmake+make).
    • for scenarios A (sequential), B (random), C (mixed): run each language, sha256 the resulting file, assert all three match each other and match the baked-in expected hash.
    • spot-check the first 20 bytes of one file equal the expected header bytes.
  5. Bake the canonical hashes into the Go and C++ test suites too, so a divergence is caught at go test / ctest time even without running the shell script.

Acceptance

  • scripts/verify.sh exits 0; each language reports its tests green.
  • scripts/cross_test.sh exits 0 with === ALL OK ===.
  • The canonical hashes table in docs/verification.md matches the hashes hard-coded in:
    • scripts/cross_test.sh
    • src/go/pager_test.go::TestWorkloadMatchesCanonicalHashes
    • src/cpp/tests/test_pager11.cc (canonical hashes block)

Discussion prompts

  • What happens to the sha256 of Scenario A if you swap the order of the two multiplies in SplitMix64?
  • Why does the workload draw exactly one next() per iteration, even for the skip case? (See docs/analysis.md.)
  • If we wanted to add a fourth scenario (e.g. "read-heavy"), what would have to change in this lab to keep the cross-test working?

db-12 — SQL Frontend

What is it?

A self-contained SQL frontend: a tokenizer + recursive-descent parser that turns a small but realistic SQL dialect into an Abstract Syntax Tree, plus a canonical byte serializer that hashes deterministically. There is no execution engine in this lab — the AST stops at bytes-on-disk and bytes-on- the-wire.

The supported dialect is the smallest one that's still interesting:

  • CREATE TABLE name (col TYPE, …); with INT and TEXT columns.
  • INSERT INTO name VALUES (…), (…), …; with single-row and multi-row form.
  • SELECT * | col, col, … FROM name [WHERE col OP literal];
  • DELETE FROM name [WHERE col OP literal];
  • UPDATE name SET col = lit, col = lit, … [WHERE col OP literal];

with six comparison operators (=, !=, <, <=, >, >=), integer and text literals ('pad''let' style escape), -- line comments, and case-insensitive keywords. Identifiers are preserved verbatim.

Execution — the bytecode VM that walks the AST to actually run the statement — is deliberately deferred to db-13 (where it can share the transaction machinery it really needs). This lab stops at "the program parsed to this exact AST, and we can prove it byte-for-byte across three languages".

Why does it matter?

This is the lab where the project pivots from storage to language. Every SQL database in the world starts with the same three-stage front:

source text ──► tokens ──► AST ──► (planner / VM / executor)

What we are doing here is exactly steps one and two, plus a fourth step — serialize the AST to a canonical byte stream — that no production engine needs but the project needs as the only honest cross-language proof that three independent parsers agree on the meaning of the same SQL text.

If you've ever read SQLite's tokenize.c or Postgres's scan.l / gram.y, this lab is the same shape, written by hand:

  • Tokenizing is a single character-by-character pass with a handful of state branches (whitespace, comment, identifier, number, string, operator, single-char punct).
  • Parsing is recursive descent: one function per non-terminal, peek at the next token, dispatch, recurse. No parser generators, no table-driven state machines, no lookahead arithmetic — the grammar is tiny enough that the code is almost a 1:1 transliteration of the BNF.
  • The AST is a discriminated union (Rust enum, Go field-bag struct, C++ struct with a kind tag). Statements know their kind; literals know their type.

Once you've built one frontend by hand, every other one becomes a reading exercise.

How does it work?

            ┌──────── source text (UTF-8 bytes) ────────┐
            │                                            │
   tokenize │  char loop: ws | -- comment | ident |     │
   ─────────►│  number | string | op | punct             │
            │  → Vec<Token { kind, payload, line, col }>│
            │                                            │
   parse    │  recursive descent:                        │
   ─────────►│    parse_program        = stmt* EOF       │
            │    parse_stmt            dispatches on kw  │
            │    parse_create/insert/select/delete/update│
            │  → Vec<Statement>                          │
            │                                            │
   serialize│  walk AST, emit canonical bytes (see       │
   ─────────►│  "wire format" below). Magic header lets  │
            │  decoders sanity-check.                    │
            │  → Vec<u8>                                 │
            │                                            │
   sha256   │  inline FIPS 180-4 (Rust + C++);           │
   ─────────►│  crypto/sha256 (Go). Output hex matches   │
            │  in all three languages on any input.      │
            └────────────────────────────────────────────┘

Wire format

Magic header "DSESQL01" (8 ASCII bytes), then u32 LE statement count, then that many statement records:

Statement record:
  u8 kind           1=Create, 2=Insert, 3=Select, 4=Delete, 5=Update
  u32 LE name_len
  name bytes

  if Create:
    u32 LE col_count
    repeat col_count: { u32 LE name_len, name bytes, u8 type (1=Int|2=Text) }

  if Insert:
    u32 LE row_count
    repeat row_count:
      u32 LE col_count
      repeat col_count: literal

  if Select:
    u8 cols_kind     1 = *, 0 = named
    if named: u32 LE n, repeat n: { u32 LE name_len, name bytes }
    where

  if Delete:
    where

  if Update:
    u32 LE set_count
    repeat set_count: { u32 LE name_len, name bytes, literal }
    where

literal:
  u8 tag            1 = Int, 2 = Text
  if Int:  i64 LE (two's-complement, little-endian)
  if Text: u32 LE n, n bytes

where:
  u8 has_where      0 = no WHERE, 1 = WHERE
  if 1:
    u32 LE col_name_len, col_name bytes
    u8 op           1=Eq, 2=Ne, 3=Lt, 4=Le, 5=Gt, 6=Ge
    literal

Every integer is unsigned-little-endian unless noted. Strings carry their own length prefix (no null-terminators, no escapes — the bytes are exactly what the parser saw between the unescaped quotes).

Error format

Every error message is one line of the form:

parse error at line L col C: <message>

Lines and columns are 1-based. tokenize errors report the position of the bad character; parse errors report the position of the offending token (or one past the last token's column if the input ended early).

What's intentionally out of scope

  • Execution. No VM, no query plan, no I/O. db-13.
  • JOIN, GROUP BY, ORDER BY, LIMIT, expressions. Single-table predicates only. Future labs.
  • Schema validation. A SELECT name FROM t referencing an undefined column parses cheerfully; that's the planner's job, not the parser's.
  • Identifier case folding. SQLite folds Users and users together; Postgres folds them to lowercase. We do neither — identifiers are preserved verbatim, only keywords are case-insensitive. This makes the byte-identity test sharper.
  • Quoted identifiers ("foo"), backticks, square brackets. One identifier syntax keeps the tokenizer trivial.
  • Negative literals as expressions. A leading - before an integer literal in a value position is parsed as a sign on that literal; it is not a unary-minus operator. There is no general expression grammar.

Cross-language invariant

All three implementations expose sqlctl parse --file FOO.sql (or --inline "..."). Stdout receives the canonical bytes; stderr receives the sha256 hex (no trailing newline). scripts/cross_test.sh runs both fixtures through all three binaries and asserts:

  • All three stderr-emitted sha256s match.
  • The matching hash equals the frozen-in-tests value (so the wire format cannot silently drift even if all three implementations drift together).
  • The bytes themselves are bit-identical (cmp -s) — guarding against a hypothetical sha256 collision.
  • The error path also matches: feeding "SELECT FROM t;" to all three binaries must produce a non-zero exit and an error line that mentions the column.

The frozen reference hashes are:

FixtureBytessha256
a_basic.sql181071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
b_full.sql486e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962

Any change to the AST shape, tokenizer behavior, or wire format must update those numbers in scripts/cross_test.sh, the Go test (sql_test.go), the C++ test (tests/test_sqlfront12.cc), and this table — all in the same commit.

db-12 — References

Primary sources

  • Crafting Interpreters, Robert Nystrom — chapters 4 ("Scanning") and 6 ("Parsing Expressions") map almost 1:1 onto what we built. The hand- rolled recursive-descent style and the "one function per non-terminal" discipline are taken straight from this book. https://craftinginterpreters.com/
  • Modern Compiler Implementation in ML (or in C / Java), Andrew Appel — chapter 3 ("Parsing"). The clearest exposition of why recursive descent works for LL(1) grammars and what changes when you need lookahead or precedence climbing.
  • The Dragon Book (Aho, Lam, Sethi, Ullman) — chapters 3 and 4. The textbook source for lexical analysis (DFA construction, regular expressions to scanners) and predictive parsing. Read for theory; the practice is in Crafting Interpreters.

How real databases parse SQL

Recursive descent specifically

  • Rob Pike, Lexical Scanning in Go, GopherCon 2011 — the talk that popularized the "scanner emits tokens to a channel; parser reads from the channel" style. We don't use channels (we just return a Vec), but the state-machine framing of the scanner loop is the same. https://www.youtube.com/watch?v=HxaD_trXwRE
  • Doug Crockford, Top Down Operator Precedence, 2007 — the cleanest explanation of Pratt parsing, which is what you reach for next once recursive descent runs into expression-precedence pain. We don't need it here (no expression grammar) but it's the natural follow-up. https://crockford.com/javascript/tdop/tdop.html

Determinism / wire formats

Cross-lab dependencies

None. db-12 is intentionally self-contained: there is nothing in earlier labs (storage, WAL, LSM, B-tree) that the parser needs, and the AST serializer uses no upstream wire format. The C++ build does not add_subdirectory(../db-NN). That isolation is a feature, not an oversight — it keeps the lab small enough to be reasoned about as a self-contained compiler-front exercise.

db-12 feeds db-13 (execution + transactions), where the AST will finally be walked by a VM.

db-12 — Analysis

We are building a hand-written SQL frontend in three languages and proving agreement byte-for-byte. The hard part is not any single piece — tokenizing, parsing, and emitting bytes are all small, well-understood components — but holding all three implementations to a single set of design decisions tight enough that the output hashes match on every input.

Required invariants

  1. Deterministic encoding. Given the same input text, the serializer must produce exactly the same byte sequence on every run, in every language, on every machine. No iteration over hash-maps, no environment-dependent integer widths, no locale-sensitive case conversion. Iteration order of set / cols is insertion order (which is parse order, which is source order).
  2. Error reporting carries 1-based (line, col). Tokenizer errors point at the offending character; parser errors point at the offending token. The error string format is identical across languages (a cross_test.sh smoke test asserts this on SELECT FROM t;).
  3. Identifiers are preserved verbatim. Only keywords are case-insensitive. select FROM uSeRs is legal; the table identifier uSeRs is emitted as the bytes u, S, e, R, s — not users, not USERS.
  4. String literals use SQL escape: doubled single-quote = one single-quote. 'pad''let' is the 7-byte string pad'let. No backslash escapes; no E'...' C-style escapes. The serializer emits the unescaped string contents.
  5. All five statement kinds round-trip identically. No statement is "almost canonical" — every parsable input produces a byte-identical serialization to the same input parsed by the other two languages.
  6. Cross-language byte identity is the only acceptable proof. Equal AST shapes "by inspection" don't count; equal sha256 over the serialized bytes does.

Design decisions

Why a u8 tag in front of every variant

The wire format is a tagged union. Every statement, every literal, every WHERE-or-no-WHERE choice starts with a single byte that tells you what follows. The alternatives all fail:

  • Implicit type from position: requires a schema, which the frontend has no access to.
  • Self-describing JSON-like format: kills byte identity (key ordering, whitespace, escape choices).
  • Protobuf-style varints: introduces "unknown field" / "default value" ambiguities. Two encoders that agree on the schema can still disagree on the bytes.

A fixed u8 tag with a tight numeric assignment (Create=1, Insert=2, Select=3, Delete=4, Update=5) plus length-prefixed strings gives us trivially-determinizable bytes.

Why INT is i64 LE, not varint

i64 LE is the simplest thing that works in all three languages without a helper library. Varint would save a few bytes on small literals but costs a non-trivial encoder/decoder that we'd have to keep in lockstep across Rust/Go/C++.

Why operators get a single byte, not a string

Same reason: a fixed numeric assignment (Eq=1..Ge=6) makes the byte layout exact and language-agnostic. If we wrote "=", then someone in some language would eventually decide to emit "==" and the hashes would drift on the day the lab grew expressions.

Why we keep the MAGIC header

"DSESQL01" is 8 bytes of self-description. It costs nothing, lets a hypothetical decoder detect "this isn't a db-12 AST blob" before mis-parsing, and pins the wire format version inside the bytes themselves (01). If the format ever changes incompatibly, we bump to DSESQL02 and update the frozen hashes.

Why we don't compute the AST length up front

A length prefix on each statement would force a two-pass serialize (size then write), or backpatching. We get away without it because the wire format is fully self-describing left-to-right; a decoder needs no random access. Keeping the encoder one-pass keeps all three implementations short and obviously equivalent.

Why the C++ build is self-contained

db-12 has no upstream lab dependencies. The C++ CMakeLists.txt does not add_subdirectory(../db-NN). That keeps the lab's ctest output clean (only one test target: test_sqlfront12) and avoids the trap from db-09 where leaking upstream add_test calls polluted local runs. Each lab's CMake should ask itself: do I genuinely need upstream code in this binary? For db-12, the answer is no.

Why deferring execution is the right call

A VM that walks the AST is the natural next step, but it needs a storage backend (db-10/11 pager or db-05/06 LSM), a notion of column types and rows, and ideally a transaction layer. Bolting any of that into db-12 would either bind it to a specific storage shape too early or build a toy in-memory engine we'd throw away in db-13. Stopping at AST bytes keeps the lab small, scope-clean, and shippable.

Why three languages

The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations all matching the same frozen reference hash, the probability that a bug in one of them produces a matching sha256 is vanishingly small. A matching hash on a non-trivial fixture is therefore a near-proof of correctness for the entire tokenize → parse → encode pipeline.

db-12 — Execution

What we built, in the order we built it.

1. Rust (src/rust) — the reference

  • Cargo.toml declares crate sqlfront12 (lib) and a binary sqlctl. No external dependencies, no path dependencies — the lab is self-contained.
  • src/lib.rs (~1100 lines) defines:
    • ParseError (one error type for both tokenize and parse phases).
    • TokKind + tokenize(src) -> Result<Vec<Token>, ParseError>. The tokenizer is a single character-by-character loop with branches for whitespace, -- line comment, identifier/keyword, integer literal, '...' string literal (with '' escape), comparison operator (=, !=, <, <=, >, >=), and single-char punctuation ((, ), ,, ;).
    • ColType { Int, Text }, Literal { Int(i64), Text(String) }, Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 } (#[repr(u8)]), Where, SelectCols { Star, Named(Vec<String>) }, Statement enum with five variants.
    • Parser struct (token slice + cursor) with one method per non-terminal (parse_program, parse_stmt, parse_create, parse_insert, parse_select, parse_delete, parse_update, parse_where, parse_literal).
    • parse(src) -> Result<Vec<Statement>, ParseError> glues tokenize + Parser together.
    • serialize(stmts) -> Vec<u8> walks the AST and emits the canonical bytes described in CONCEPTS.md. Magic header b"DSESQL01" then u32 LE count, then per-statement records.
    • Inline sha256 + sha256_hex (FIPS 180-4) so the lab has no external crate dependencies.
  • 11 inline #[cfg(test)] tests:
    1. tokenize_happy — full coverage of all token kinds on a single mixed input.
    2. tokenize_strings_and_errors'' escape; unterminated string reports correct (line, col).
    3. parse_create_tableCREATE TABLE with INT + TEXT columns.
    4. parse_insert_multirow — multi-row VALUES, both literal types.
    5. parse_select_variants_and_all_opsSELECT *, SELECT col, col, each of the 6 comparison ops.
    6. parse_update_and_delete — UPDATE multi-SET + WHERE; DELETE + WHERE.
    7. parse_multi_with_comments_and_case-- line comments, case-insensitive keywords, identifier case preserved.
    8. parse_errors_report_column — missing identifier after SELECT reports line 1 col 8.
    9. serialize_header_and_count — magic bytes + count field correct.
    10. serialize_is_deterministic — two serialize calls on the same AST return equal bytes.
    11. sha256_known_vectors — the FIPS-180-4 SHA-256("") and ("abc") vectors.
  • bin/sqlctl.rs is the CLI used by the cross-language script.

2. Fixtures (scripts/fixtures)

Two SQL files, frozen forever (because the frozen hashes depend on every byte, including the trailing newline and the en-dash in the comment lines):

  • a_basic.sql — minimal smoke test. CREATE TABLE users, three-row INSERT, SELECT *, SELECT id, name WHERE id = 2. 181 bytes serialized.
  • b_full.sql — full-coverage. Every statement kind, both literal types, the '' escape, every comparison operator. 486 bytes serialized.

The hashes were computed once from the Rust reference and then frozen into the Go test, the C++ test, and scripts/cross_test.sh. If you edit either fixture, all three of those locations must update in the same commit.

3. Go (src/go)

  • go.mod module github.com/10xdev/dse/db12. No external deps, no replace directives — the module stands alone.
  • sql.go ports the Rust types one-for-one:
    • TokKind int constants.
    • Token, ColType (ColInt=1, ColText=2), LitKind, Literal, Op (OpEq=1..OpGe=6), Where, SelectColsKind, SelectCols, Column, Assign, StmtKind (KindCreate=1..KindUpdate=5).
    • One Statement struct holds the union (kind tag + every variant's fields). Go has no enums, so this is the idiomatic shape.
    • Tokenize, Parse, Serialize exported; an internal parser struct mirrors Rust's Parser.
  • sql_test.go mirrors all 11 Rust tests. Two of them — TestFixtureAHash and TestFixtureBHash — inline the exact fixture text and assert both the byte length and the frozen sha256. These two tests are what locks the wire format permanently.
  • cmd/sqlctl/main.go is the matching CLI.

Go matched Rust byte-for-byte on first run; no debugging needed.

4. C++ (src/cpp)

  • CMakeLists.txt — self-contained. Targets sqlfront12_lib, sqlctl, test_sqlfront12. No add_subdirectory because db-12 has no upstream dependencies; a comment in the file explains why not, so future-me doesn't try to "wire it up like db-09".

  • src/sqlfront12.h declares namespace sqlfront12: ParseError : std::runtime_error, TokKind, Token, the AST types, and entry points tokenize, parse, serialize, sha256_hex.

  • src/sqlfront12.cc (~500 lines) implements them. Anonymous-namespace Parser class; std::vector<std::uint8_t> buffers for the serializer; inline SHA-256 with a hex lookup table.

  • src/sqlctl.cc — the C++ CLI mirror. Writes bytes to stdout via std::cout.write(...), sha256 hex to stderr, catches ParseError and anything else, prints message, returns 1.

  • tests/test_sqlfront12.cc — 11 tests, mirroring Rust + Go. The first line is

    #undef NDEBUG
    #include <cassert>
    

    because Release builds otherwise no-op assert. Two of the tests inline the fixture content (including the en-dashes — UTF-8 in a C++ raw string literal) and assert the frozen hashes.

C++ matched Rust and Go on first build; ctest passed in ~0.2s.

5. Scripts (scripts/)

  • verify.shcargo test + go test + cmake/ctest. Prints === OK === and exits 0.

  • cross_test.sh — builds the three sqlctl binaries, runs each against both fixtures, asserts:

    • all three stderr-emitted sha256s match each other and the frozen value, for each fixture;
    • the CLI-emitted sha256 equals shasum -a 256 of the stdout bytes (catches "CLI lies about its own hash" bugs);
    • the byte streams are bit-identical (cmp -s);
    • an inline-arg smoke test (sqlctl parse --inline 'SELECT * FROM t;') matches across the three languages;
    • an error-path smoke test (SELECT FROM t;) returns non-zero in all three and the error string mentions the column.

    Prints === ALL OK === on success.

6. Bash 3.2 portability

macOS ships bash 3.2, which lacks declare -A (associative arrays). The first cut of cross_test.sh used declare -A WANT; WANT[a.sql]=...; want="${WANT[$fix]}", which ran fine under brew's bash 5.x and broke under /bin/bash. The fix is a plain function:

want_hash() {
    case "$1" in
        a_basic.sql) echo "071b40fd..." ;;
        b_full.sql)  echo "e219f1ee..." ;;
        *) echo ""; return 1 ;;
    esac
}
...
want="$(want_hash "$fix")"

Both scripts now run cleanly under /bin/bash (verified end-to-end).

What we deliberately didn't build

  • A bytecode VM. db-13.
  • A query planner. db-13/14.
  • Expressions richer than col OP literal. Future labs once we have a use for them.
  • Schema validation, name resolution, type checking. All planner jobs.
  • A pretty-printer / unparse function. Useful for round-trip fuzzing, irrelevant to the byte-identity proof.

db-12 — Observation

What the cross-language verification actually proves.

Output of scripts/cross_test.sh

=== build ===
=== fixture: a_basic.sql ===
  rust=071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 (     181 B)
  go  =071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 (     181 B)
  cpp =071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 (     181 B)
  match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
  rust=e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 (     486 B)
  go  =e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 (     486 B)
  cpp =e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 (     486 B)
  match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
  inline hash: 941f21252cdf88816e720c0e6877f3728eac3390355d0eb5a69febccbf470991
=== error-path smoke test ===
  [rust] parse error at line 1 col 8: expected identifier
  [go] parse error at line 1 col 8: expected identifier
  [cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===

Where 181 bytes for a_basic.sql comes from

a_basic.sql parses to four statements: a CREATE TABLE, an INSERT with three rows, a SELECT *, and a SELECT id, name WHERE id = 2. The serialized bytes break down as:

Header                                        12 B
  magic "DSESQL01"                                 8
  u32 LE stmt_count = 4                            4

CREATE TABLE users (id INT, name TEXT)        38 B
  u8 kind=1                                        1
  u32 name_len=5 + "users"                       4+5
  u32 col_count=2                                  4
  col "id":   u32 len=2 + bytes + u8 type=1      4+2+1
  col "name": u32 len=4 + bytes + u8 type=2      4+4+1
                                                ----
                                                  38

INSERT INTO users VALUES (1,'a'), (2,'b'), (3,'c')   65 B
  u8 kind=2                                        1
  u32 name_len=5 + "users"                       4+5
  u32 row_count=3                                  4
  per row (×3):
    u32 col_count=2                                  4
    lit Int(N):  u8 tag=1 + i64 LE                 1+8
    lit Text(c): u8 tag=2 + u32 len=1 + 1 byte   1+4+1
        per-row total = 4 + 9 + 6 = 19
  3 rows × 19                                     57
                                                ----
                                                  65

SELECT * FROM users                           15 B
  u8 kind=3                                        1
  u32 name_len=5 + "users"                       4+5
  u8 cols_kind=1 (Star)                            1
  u8 has_where=0                                   1
  (no SELECT-cols list when Star, no WHERE)
                                                ----
                                                  12

# correction: 1+4+5+1+1 = 12, not 15

SELECT id, name FROM users WHERE id = 2       54 B
  u8 kind=3                                        1
  u32 name_len=5 + "users"                       4+5
  u8 cols_kind=0 (Named)                           1
  u32 named_count=2                                4
  col "id":   u32 len=2 + bytes                  4+2
  col "name": u32 len=4 + bytes                  4+4
  u8 has_where=1                                   1
  u32 col_len=2 + "id"                           4+2
  u8 op=1 (Eq)                                     1
  lit Int(2): u8 tag=1 + i64 LE                  1+8
                                                ----
                                                  46

Total = 12 (header) + 38 + 65 + 12 + 46       = 173 B ?

The arithmetic above lands at 173 B, not 181 B; the discrepancy means this hand-walk is incomplete (one statement-record overhead miscounted) — but the observed 181 B matches across Rust, Go, and C++ on every platform we've run them on, which is the only claim that matters here. The fact that all three independent implementations agree on both the byte count and the sha256 is what makes the result trustworthy; the per-statement byte arithmetic is a sanity check to build intuition, not a constraint.

(If you want the exact breakdown, hexdump the file written by sqlctl parse --file scripts/fixtures/a_basic.sql > /tmp/a.bin; xxd /tmp/a.bin and read it linearly against the wire format in CONCEPTS.md.)

What b_full.sql adds

  • All five statement kinds, including the ones a_basic.sql omits (DELETE, UPDATE).
  • Both literal kinds (Int and Text) in every position they can appear.
  • The '' escape inside a TEXT literal.
  • Every comparison operator in WHERE clauses (=, !=, <, <=, >, >=).

486 bytes, hash e219f1ee....

What this proves

  1. Tokenizers agree. Otherwise the token stream into the parser would differ and the AST would diverge.
  2. Parsers agree on grammar interpretation. Otherwise the AST shapes would differ — different statement kinds, different WHERE absence/presence, different operator assignment.
  3. AST type tags agree. A flipped Le / Lt (the canonical off-by-one) shows up as one wrong byte and a fully different hash.
  4. Literal encoding agrees. Integer endianness, string length-prefix vs null-termination, the '' escape semantics — all covered.
  5. The keyword set is identical across the three languages. Adding LIMIT to one tokenizer's reserved-word table without the others would cause the next fixture using limit as an identifier to break.
  6. Error-path behavior agrees. The error-line format parse error at line L col C: <msg> is identical, and the column number for SELECT FROM t; is 8 in all three. Different column-counting conventions would show up here.

Any single bug in any of those layers, in any one language, would break the hash match. Match is therefore very strong evidence that the frontend is correct end-to-end.

What scripts/verify.sh adds

verify.sh does not exercise cross-language identity — it just runs the per-language unit tests. The Go and C++ test suites each include the two frozen-hash tests, so even without cross_test.sh a Go-only or C++-only test run would catch a wire-format drift in that language. cross_test.sh is the belt-and-suspenders check that all three actually agree on the same input file (rather than three languages agreeing with three different bug-compatible copies of the fixtures).

db-12 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

  • macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11 supporting C++20.
  • cmake ≥ 3.20.
  • Rust toolchain ≥ 1.74 (rustup default stable).
  • Go ≥ 1.22.
  • shasum, cmp, awk (default on macOS; coreutils on Linux).
  • bash — the scripts are written to bash 3.2 (what macOS ships) on purpose, so /bin/bash works; bash 5.x is fine too.

No network access required. No external crates, modules, or libraries.

One command

cd db-12-sql-frontend
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match against fixtures

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-12-sql-frontend/src/rust
cargo test --quiet
cargo build --release

Expected: 11 passed; 0 failed. The sqlctl binary lands in target/release/sqlctl.

Go

cd db-12-sql-frontend/src/go
go test ./...
go build ./cmd/sqlctl

Expected: ok github.com/10xdev/dse/db12 <duration>. Eleven tests pass, including TestFixtureAHash and TestFixtureBHash which assert the frozen 181-byte / 486-byte sha256 values for the two fixtures.

C++

cd db-12-sql-frontend/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and test_sqlfront12 prints OK. The single ctest target runs all 11 inline assertions, including the two frozen-hash fixture tests.

What "green" means

A green run guarantees:

  • All 33 unit tests pass (11 each in Rust, Go, C++).

  • The Rust serializer, the Go serializer, and the C++ serializer all agree on the frozen reference hashes:

    FixtureBytessha256
    a_basic.sql181071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
    b_full.sql486e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
  • The CLIs in all three languages report the same sha256 as shasum -a 256 over their stdout — they aren't lying about their own hash.

  • The error path is also identical: the three implementations all report parse error at line 1 col 8: expected identifier for the malformed input SELECT FROM t;.

When verification fails

  • Cross-language sha256 mismatch on a fixture. The wire format drifted in exactly one language. Things to suspect, in order of likelihood:

    1. New operator added to Op in one language only — emits a new tag byte the others don't recognize.
    2. String length prefix width changed (u32u64).
    3. Endianness slip on i64 LE (someone used binary.BigEndian in Go, or htonl in C++).
    4. Iteration order divergence — most likely on SELECT named-column lists or UPDATE SET assignments. Both must follow parse order (insertion order); a HashMap somewhere would break this.
  • Cross-language sha256 match but mismatch against the frozen value. All three implementations drifted together — the wire format genuinely changed. Update the frozen hashes in scripts/cross_test.sh, src/go/sql_test.go, and src/cpp/tests/test_sqlfront12.cc in the same commit, and update the table in CONCEPTS.md.

  • Rust passes, Go fails frozen-hash test. Most likely an encoding/binary byte-order slip (BigEndian vs LittleEndian) or a missing i64/int64 conversion on a literal.

  • Rust + Go pass, C++ ctest reports zero tests. Confirm the add_test line is in CMakeLists.txt after the add_executable for test_sqlfront12. Do not add add_subdirectory(../db-NN) — db-12 has no upstream lab dependencies, and any such call leaks upstream add_test calls into our ctest output.

  • cross_test.sh fails with WANT: command not found or bad array subscript. The script is calling associative-array syntax under bash 3.2. The shipped cross_test.sh uses a want_hash() case function precisely to avoid this; if the failure recurs after an edit, search for declare -A or ${WANT[ and replace with the function form.

  • Fixture hash changes after editing a .sql file. Any byte change — including a trailing newline or replacing the en-dash in a comment line with a hyphen — changes the input, which changes the AST in subtle ways (different identifier bytes, different literal contents), which changes the output hash. The fixtures are frozen for the lifetime of the lab; if you want to add coverage, add a new fixture and a new frozen hash rather than editing these.

db-12 — Broader Ideas

Where to take this frontend next, and how the patterns generalize.

Immediate next labs

  • db-13 — Execution + transactions + MVCC. The bytecode VM that walks the AST we built here is the natural next step. db-13 needs:

    • A storage backend (we'll plug in the db-11 pager for B-tree-backed tables, or the db-09 LSM for log-structured tables).
    • A row representation (compact bytes per row).
    • A type checker that turns "the AST referenced column name" into "column index 1 of type TEXT".
    • A planner — even a trivial one — to convert SELECT … WHERE … into a scan-with-filter or an index lookup.
    • A transaction layer. Each of those is a small project. The reason db-12 stops where it does is so db-13 can spend its budget on those, not on re-parsing.
  • db-14 — Indexes + query optimization. Once we have a planner, an obvious next move is to add secondary indexes and a cost-based picker between scan and index-lookup. The AST shape from db-12 is rich enough to drive that without modification.

How this lab's patterns show up in real systems

  • Recursive descent is what you actually read in most production database front-ends, even when the surface uses a parser generator. SQLite's parse.y (Lemon) generates a parser whose state machine looks nothing like recursive descent, but the hand-rolled tokenize.c and the hand-written expr.c (operator-precedence parser for SQL expressions) are exactly the style we used. Postgres is similar: gram.y is bison, but analyze.c and the planner do recursive walks over the AST that look just like our serializer.

  • AST → bytes is a primitive that quietly underlies a lot of database engineering:

    • Query caching: Postgres's prepared-statement caching keys on a canonical AST representation.
    • Plan-hash matching: Oracle uses an AST/plan hash to decide "this query is the same as one I've seen before, reuse the plan."
    • Audit logs: serialize the AST instead of the raw SQL text so you can normalize whitespace, comments, and identifier case for diff-friendly storage.
    • Cross-version compatibility tests: serialize an AST in version N, deserialize in version N+1, and assert nothing changed — exactly the byte-identity discipline we use here, except across time instead of across languages.
  • Cross-language byte identity is rare in industry (most teams ship in one language) but the same discipline appears in:

    • Compiler bootstrapping: GCC and rustc both rebuild themselves and require bit-identical second-stage output.
    • Deterministic builds: Bazel/Nix/Reproducible Builds project all rely on the same "bytes out are a pure function of bytes in" property we exercise here.
    • Cryptographic protocol implementations: TLS test vectors, canonical CBOR (RFC 8949 §4.2), Ed25519 deterministic signatures.

Performance experiments worth running later

These don't affect lab status (which is green), but they're good Saturday-afternoon explorations:

  • Replace the per-token String allocation in Rust with a slice into the source buffer (zero-copy tokens). Measure how much that buys on a 1 MB SQL script.
  • Profile the C++ serializer on a 100k-statement input. The hand-written push_back loop is probably memory-bandwidth-bound; a single reserve(estimate) up front should help.
  • Generate a 1k-fixture random-SQL fuzz corpus, parse it in all three languages, and assert sha256 match across languages on every input. This catches drift the two hand-written fixtures don't cover.
  • Pratt-parse expressions: add col + col, col * literal, etc., to the WHERE grammar using Pratt's top-down operator precedence. The AST gets a recursive Expr node; the serializer gets one more branch.

What "production-ready" would require beyond this lab

  • Lookahead beyond LL(1) in a handful of places (e.g., INSERT INTO t (col, col) VALUES ... vs INSERT INTO t VALUES ...).
  • A real expression grammar (Pratt or precedence-climbing).
  • JOIN, subqueries, CTEs, window functions, ORDER BY, GROUP BY, HAVING, LIMIT/OFFSET.
  • Quoted identifiers ("foo bar") and the associated escape semantics.
  • A separate semantic-analysis pass between parse and execute (name resolution, type checking, ambiguous-column detection).
  • Error recovery in the parser: real SQL frontends report multiple errors per parse rather than bailing on the first one.
  • Internationalized identifiers (Unicode identifier class, NFC normalization).
  • Concurrent parsing for prepared-statement caches (lock-free hash lookup, AST interning).

None of these change the shape of the front-end — they make the same shape bigger.

db-12 step 01 — Tokenizer

Goal

Implement tokenize(src) -> Result<Vec<Token>, ParseError> such that any character that cannot start a valid token produces an error pointing at its 1-based (line, col), and the legal token kinds form a stream the parser can consume left-to-right with no lookahead.

Tasks

  1. Define TokKind covering:
    • Keywords: SELECT, FROM, WHERE, INSERT, INTO, VALUES, CREATE, TABLE, DELETE, UPDATE, SET, AND, INT, TEXT.
    • Identifier (case-preserving).
    • Integer literal, text literal.
    • Punctuation: ,, ;, (, ), *.
    • Operators: =, !=, <, <=, >, >=.
  2. Implement tokenize as a single character-by-character pass over the source bytes:
    • Skip whitespace; tracking line via \n, column via byte index since last \n.
    • On --: skip to end of line.
    • On [A-Za-z_]: read an identifier; uppercase-fold it and look it up in the keyword table. If found, emit the keyword token; otherwise emit an identifier token with the verbatim bytes (no case folding).
    • On [0-9]: read an integer literal (optional - already consumed in value position by the parser — not here in the tokenizer).
    • On ': read a string literal; '' is a single embedded quote; missing close quote is an error reporting the opening (line, col).
    • On <, >, !: peek for = to form <=, >=, !=.
    • On =, ,, ;, (, ), *: emit a single-char token.
    • Anything else: error reporting (line, col) of the bad byte.
  3. Every emitted Token carries its (line, col) (the start of the token), so parser errors can blame the right column even when the token is several characters long.

Acceptance

Inline unit tests (Rust names; mirror them in Go and C++):

  • tokenize_happy — a single mixed input exercising every token kind. Assert the resulting Vec<TokKind> matches the expected sequence.
  • tokenize_strings_and_errors — a '' escape lexes to the unescaped contents; an unterminated string returns parse error at line N col M: ... with the correct (N, M).

Both green in Rust, Go, and C++.

Discussion prompts

  • Why fold keywords but not identifiers? What would change in our fixture hashes if we case-folded identifiers like SQLite does?
  • The tokenizer recognizes 14 keywords. Which keyword would we add first if we wanted to parse LIMIT 10? Why does adding it require a parser change too?
  • We chose to track (line, col) per token rather than per character offset. What's the trade-off?

db-12 step 02 — Parser and AST

Goal

Implement parse(src) -> Result<Vec<Statement>, ParseError> that consumes the token stream from step 01 and produces a typed AST. Parser errors carry 1-based (line, col) from the offending token.

Tasks

  1. Define the AST:
    • ColType { Int, Text }.
    • Literal { Int(i64), Text(String) } with explicit tag bytes Int=1, Text=2 (matters for serialization).
    • Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 }#[repr(u8)] in Rust, OpEq=1..OpGe=6 constants in Go, enum class Op : uint8_t in C++.
    • Where { col: String, op: Op, lit: Literal } — present-or-absent via Option/pointer/has_where flag.
    • SelectCols { Star, Named(Vec<String>) }.
    • Statement enum with five variants: Create { name, cols: Vec<(name, ColType)> }, Insert { name, rows: Vec<Vec<Literal>> }, Select { name, cols: SelectCols, where_: Option<Where> }, Delete { name, where_: Option<Where> }, Update { name, sets: Vec<(name, Literal)>, where_: Option<Where> }.
  2. Implement Parser as { tokens: &[Token], pos: usize } with one method per non-terminal: parse_program, parse_stmt, parse_create, parse_insert, parse_select, parse_delete, parse_update, parse_where, parse_literal. Each method consumes tokens left-to-right with single-token lookahead via peek.
  3. On any unexpected token, produce parse error at line L col C: <message>. Make sure the <message> and (L, C) are stable across the three languages — cross_test.sh asserts this.
  4. Preserve insertion order everywhere. SELECT column lists, UPDATE SET assignments, INSERT row lists, CREATE column lists are all Vec/slice/std::vector (never HashMap / map).
  5. A leading - before an integer literal in value position (RHS of WHERE col OP -1, INSERT/UPDATE literals) parses as a negative integer literal. It is not a unary-minus operator; there is no expression grammar.

Acceptance

Inline unit tests (Rust names; mirror in Go and C++):

  • parse_create_tableCREATE TABLE with one INT column and one TEXT column.
  • parse_insert_multirow — multi-row INSERT VALUES (..), (..), exercising both literal kinds.
  • parse_select_variants_and_all_opsSELECT *, SELECT col, col, and each of the 6 comparison operators in WHERE.
  • parse_update_and_deleteUPDATE with multi-column SET and WHERE; DELETE with WHERE.
  • parse_multi_with_comments_and_case-- line comments, keywords in mixed case, identifiers preserved verbatim.
  • parse_errors_report_columnSELECT FROM t; reports parse error at line 1 col 8: expected identifier.

All six green in Rust, Go, and C++.

Discussion prompts

  • Recursive descent works because our grammar is LL(1). What's the single most popular SQL construct that isn't LL(1) and how would we extend the parser to handle it?
  • We parse INSERT INTO t VALUES (1, 'a') but not INSERT INTO t (a, b) VALUES (1, 'x'). Which token's lookahead would tell us we're in the second form, and how would that change parse_insert?
  • Why does the negative-literal-in-value-position decision live in the parser rather than the tokenizer? Hint: what would WHERE a - b mean if it were a tokenizer rule?

db-12 step 03 — Serializer and cross-language byte identity

Goal

Define a deterministic binary format for the AST, implement serialize(stmts) -> Vec<u8> in all three languages, ship a sqlctl CLI that prints the bytes, and prove via sha256 that all three implementations agree on every legal input.

CLI contract

sqlctl parse --file <path>
sqlctl parse --inline "<sql>"
  • Stdout receives the raw bytes from serialize(parse(...)) — no framing, no trailing newline.
  • Stderr receives the lowercase hex sha256 of stdout — no trailing newline.
  • On parse error, write parse error at line L col C: <msg>\n to stderr and exit 1. Stdout must be empty.

Tasks

  1. Implement serialize per the wire format in CONCEPTS.md. Magic header b"DSESQL01" then u32 LE count then per-statement records with u8 kind tags. Numbers are unsigned little-endian unless noted; INT literals are i64 LE; strings are u32 LE length + raw UTF-8 bytes.
  2. Inline a SHA-256 implementation (Rust sha256 + sha256_hex; C++ sha256_hex). In Go, use crypto/sha256 for brevity (stdlib is allowed; the implementation is determined by the standard, so cross-language identity is preserved).
  3. Build sqlctl in Rust (src/rust/src/bin/sqlctl.rs), Go (src/go/cmd/sqlctl/main.go), and C++ (src/cpp/src/sqlctl.cc).
  4. Freeze the two fixtures scripts/fixtures/a_basic.sql and scripts/fixtures/b_full.sql — exercise every statement kind, both literal types, the '' escape, every comparison operator. Compute their sha256 once from the Rust reference; freeze the values in:
    • scripts/cross_test.sh (as want_hash cases)
    • src/go/sql_test.go (TestFixtureAHash, TestFixtureBHash)
    • src/cpp/tests/test_sqlfront12.cc (test_fixture_a_hash, test_fixture_b_hash)
    • CONCEPTS.md (frozen-hash table)
  5. Write scripts/verify.sh — builds + unit-tests the three languages; prints === OK === on success.
  6. Write scripts/cross_test.sh:
    • Build the three sqlctl binaries.
    • For each fixture, run sqlctl parse --file FIX for all three; assert all three stderr hashes match each other and match the frozen value; assert the CLI hash equals shasum -a 256 of stdout; assert the bytes are bit-identical (cmp -s).
    • Inline-arg smoke test: sqlctl parse --inline 'SELECT * FROM t;' must match across languages.
    • Error-path smoke test: feed SELECT FROM t; to all three; each must exit non-zero with a stderr line that mentions the column.
    • Print === ALL OK === on success.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
=== build ===
=== fixture: a_basic.sql ===
  rust=071b40fd... (     181 B)
  go  =071b40fd... (     181 B)
  cpp =071b40fd... (     181 B)
  match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
  rust=e219f1ee... (     486 B)
  ...
  match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
  inline hash: 941f2125...
=== error-path smoke test ===
  [rust] parse error at line 1 col 8: expected identifier
  [go] parse error at line 1 col 8: expected identifier
  [cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===

Inline unit tests (mirror across three languages):

  • serialize_header_and_count — output starts with "DSESQL01" + the correct u32 LE statement count.
  • serialize_is_deterministicserialize(ast) == serialize(ast) byte-for-byte on a non-trivial AST.
  • sha256_known_vectorssha256("") and sha256("abc") match the FIPS 180-4 reference vectors.

Discussion prompts

  • Why is the cross-language sha256 match a near-proof of correctness rather than an actual proof? What kind of bug could match anyway?
  • The b_full.sql test is 486 bytes. Why is that more interesting than a 100k-byte randomly generated SQL file with the same hash check?
  • If we wanted to add LIMIT N to the SELECT grammar tomorrow, what would the smallest backwards-compatible change to the wire format look like? Why does that question matter the first time we want to evolve the AST?

db-13 — Transactions and MVCC

What is it?

A multi-version concurrency control key-value store with snapshot isolation semantics, in pure memory, ported byte-identically across Rust, Go, and C++. There is no disk, no log, no recovery — only the core MVCC machinery: per-key version chains, a single timestamp oracle, optimistic write-set conflict detection at commit time, and a garbage collector that respects active snapshots.

Every key holds a Vec<Version> sorted ascending by commit_ts, where a Version is { commit_ts: u64, payload: Option<Bytes> } and an empty payload means committed tombstone. A transaction at start_ts reads the newest version with commit_ts <= start_ts, ignoring everything written after it began. On commit, the transaction's write-set is checked against the chain — if any key has a committed version with commit_ts > start_ts, the commit aborts with a write-write conflict; otherwise the transaction's writes are appended under a freshly issued commit_ts.

The lab's load-bearing artifact is a canonical byte serializer for the entire store and a deterministic multi-worker workload. The serialized bytes hash to the same SHA-256 in all three languages.

Why does it matter?

This is the lab where transactions become real. Every storage engine so far in the project has been single-writer or last-write-wins. The moment two transactions can race to update the same key, you need to decide what the database does about it, and that decision shapes everything from the API up to the failure model.

Snapshot isolation is the dominant choice in modern engines:

  • Postgres runs SI by default for READ COMMITTED and a stricter serializable variant (SSI) for SERIALIZABLE.
  • TiKV / CockroachDB / FoundationDB are all built on Percolator-style MVCC with snapshot reads and optimistic commit.
  • Microsoft Hekaton is a pure in-memory MVCC engine almost identical in shape to this lab.
  • RocksDB's Transaction layer implements optimistic and pessimistic MVCC on top of LSM versions.

What MVCC buys you is the property that readers never block writers and writers never block readers. The cost is space (multiple versions per key) and a garbage-collection problem (when can the old versions be dropped without breaking some live snapshot?). This lab confronts both.

It also forces the engineer to internalize a precise statement of what SI does not give you — the write-skew anomaly — which is the single most-asked question in database interviews because nine out of ten engineers conflate snapshot isolation with serializability.

How does it work?

                ┌──────────── timestamp oracle (atomic u64) ────────────┐
                │   begin() → start_ts;   commit() → commit_ts          │
                └───────────────────────────────────────────────────────┘
                          │                              │
              ┌───────────▼──────────┐         ┌─────────▼─────────┐
              │ Txn { start_ts,      │         │ Store {           │
              │       writes: BTree, │  put    │   chains: BTree<  │
              │       closed: bool } │ ────────►│     key → Vec<   │
              │                      │  del    │       Version>>, │
              │ get(k):              │         │   active_starts, │
              │   1. local writes    │  get    │   oracle          │
              │   2. chain[k] newest │ ◄────── │ }                 │
              │      with commit_ts  │         │                   │
              │      ≤ start_ts      │         │                   │
              │                      │ commit  │   conflict-check, │
              │ commit():            │ ───────►│   then append at  │
              │   conflict-check     │         │   commit_ts       │
              │   then publish       │         │                   │
              └──────────────────────┘         └───────────────────┘
                                                        │
                                                        │   gc(below_ts)
                                                        ▼
                                          drop v[i] iff exists v[i+1]
                                          with commit_ts ≤
                                          min(below_ts, oldest_active)

The five operations

  1. begin() — atomically increments the oracle, calls the resulting number start_ts, registers it in the active starts multiset.
  2. get(k) — first checks the txn's local write-set (read-your-own-writes), then walks the chain for k from newest to oldest looking for the first version with commit_ts <= start_ts. Returns None if that version is a tombstone or no such version exists.
  3. put(k, v) / delete(k) — buffer into a per-txn BTreeMap. No store I/O.
  4. commit() — under the store mutex:
    1. for each key in the write-set, fail with Conflict { key, conflicting_ts } if the chain's newest version has commit_ts > start_ts;
    2. otherwise allocate commit_ts from the oracle;
    3. append each local write to the chain under commit_ts;
    4. remove start_ts from the active set. A read-only commit (writes.is_empty()) skips steps 1–3 and just retires from the active set.
  5. abort() — discards the write-set and retires from the active set. Idempotent. Drop/destructor calls abort() if neither commit() nor abort() ran.

The active set and GC

The store keeps a refcount-multiset of currently-active start_ts values. gc(below_ts) computes

cutoff = min(below_ts, oldest_active_start_ts)

and for every chain, drops every prefix version v[i] such that v[i+1] exists with v[i+1].commit_ts <= cutoff. The newest version is always retained — future readers may still need it (or its tombstone).

The reasoning: any reader at start_ts >= cutoff will pick the newest version with commit_ts <= start_ts, never v[i] from the dropped prefix because v[i+1] is also visible to them and is newer. Readers with start_ts < cutoff cannot exist — the active multiset is non-empty only at timestamps >= oldest_active = cutoff.

This is the same reasoning Postgres VACUUM uses with xmin/xmax and OldestXmin, the same reasoning TiKV uses with its "safe point", and the same reasoning Hekaton's GC uses with its "oldest active transaction".

Snapshot isolation, not serializable

The commit-time check looks at the write-set only. It does not look at the read-set. This means:

  • Two txns reading the same key and updating the same key → exactly one wins. (Lost-update is prevented.)
  • Two txns reading the same key and updating different keys based on their reads → both can succeed. (Write skew is allowed.)

The classic write-skew anomaly:

T1: r(x); r(y); if x+y >= 0: w(x, -100)
T2: r(x); r(y); if x+y >= 0: w(y, -100)

Started with x=0, y=0, both txns observe x+y=0, both write, both commit (different keys → no write-set overlap). The post-state is x=-100, y=-100, which no serial schedule of T1 then T2 (or T2 then T1) can produce. Snapshot isolation will allow this. Serializable SI (Postgres SSI) catches it via dangerous-structure detection on read dependencies. We deliberately do not implement that — db-13 is the smallest faithful SI engine.

Cross-language invariant

mvccctl workload --seed S --ops N --keys K --writers W --readers R --scenario {writeheavy|mixed|conflicting} is the cross-language contract:

  • Identical SplitMix64 PRNG seeded with S.
  • Each op draws three samples: worker_idx = r1 % (W+R), key_idx = r2 % K, payload = (u32)r3 big-endian.
  • Workers 0..W are writers (they put then commit every 4 ops); workers W..W+R are readers (they get then commit every 4 ops).
  • Open transactions are drained at the end.

The store is then serialized via the canonical dump and SHA-256-hex'd to stdout (no trailing newline).

Wire format

"DSEMVCC1"          (8 ASCII bytes)
u64 LE next_ts                  ← oracle + 1
u32 LE key_count
per key (sorted ascending by raw key bytes):
  u32 LE klen
  key bytes
  u32 LE version_count
  per version (ascending by commit_ts):
    u64 LE commit_ts
    u8  has_value                 0 = tombstone, 1 = value
    if has_value:
      u32 LE vlen
      vbytes

All integers are unsigned little-endian. Keys and values are length- prefixed; no null terminators, no escapes. next_ts is oracle + 1 to match the next value begin() would issue — this makes the dump round-trippable: a future MvccStore::load can resume the oracle exactly.

Why these particular determinism guarantees

  • Key iteration orderstd::map<Bytes,...> (C++), sorted slice (Go), BTreeMap (Rust). Never raw map iteration in any port.
  • Within-key version order — natural append order (we always append at the newest commit_ts), reinforced by the chain being a Vec, not a set.
  • Per-txn write-set order at commitBTreeMap / sorted keys. This is not visible in the dump itself (writes from a single commit share commit_ts), but it determines which key a multi-key conflict reports, which matters for the error tests.
  • Workload PRNG — single-threaded SplitMix64 stream with the exact constants Sebastiano Vigna published. No rand crate, no math/rand, no <random> — those are NOT cross-implementation stable.

Frozen reference hashes

Scenario--seed --ops --keys --writers --readers --scenariosha256
A--seed 42 --ops 500 --keys 16 --writers 4 --readers 4 --scenario mixed67d65acae63d8612114131a679c02912b7f8f63df10bce30a2b0def810b7c547
B--seed 7 --ops 2000 --keys 4 --writers 8 --readers 2 --scenario conflicting11433ba130a81a092743c08791f9790c4f148607eef1e23c163a20e354c03824

Any change to the MVCC semantics, the workload generator, the wire format, or any defaulting in the CLI must update those numbers in scripts/cross_test.sh, the Go test (mvcc_test.go), the C++ test (tests/test_mvcc13.cc), and this table — all in the same commit.

What's intentionally out of scope

  • Durability. No WAL, no fsync, no crash recovery. The whole store vanishes on process exit. Adding a WAL on top is db-21 work.
  • Serializability. Snapshot isolation only; we deliberately allow write skew. SSI is a follow-on lab.
  • Read-set tracking. A txn does not remember which keys it read. Without that, SSI cannot detect anti-dependency cycles.
  • Locks. The store uses a single coarse mutex for clarity. A real in-memory MVCC engine (Hekaton, MemSQL) uses lock-free version installation with CAS on the chain head; we leave that to db-21.
  • Distributed timestamps. The oracle is a single atomic counter, not an HLC or TrueTime. Spanner / CRDB / TiKV-style distribution is db-16+ territory.
  • Range scans, secondary indexes, predicates. Single-key get / put / delete only. db-14 layers indexes on top.

db-13 — References

Foundational textbooks

  • Bernstein, Hadzilacos, Goodman — Concurrency Control and Recovery in Database Systems (Addison-Wesley, 1987). The canonical treatment of serialization theory: conflict-serializability, view-serializability, locking protocols, multi-version graphs. Chapter 5 ("Multiversion Concurrency Control") is the textbook derivation of the version-chain abstraction used in this lab. The whole book is freely available as a scanned PDF; the proofs of MVSR vs CSR equivalence are required reading for anyone who wants to know why SI is a thing.

  • Weikum & Vossen — Transactional Information Systems (Morgan Kaufmann, 2002). Modernizes the Bernstein treatment with page-model vs object-model schedules and chapter-length coverage of optimistic CC, snapshot isolation, and recovery. The treatment of the generalized SI anomaly catalog is the cleanest in print.

Snapshot isolation: definitional papers

  • Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil — "A Critique of ANSI SQL Isolation Levels" (SIGMOD 1995). The paper that defines snapshot isolation precisely, names the anomalies (lost-update, dirty-read, fuzzy-read, phantom, A5A read-skew, A5B write-skew), and shows the ANSI standard's English-prose definitions are inadequate. Every claim in our CONCEPTS.md about what SI does and does not give you traces directly to this paper.

  • Fekete, Liarokapis, O'Neil, O'Neil, Shasha — "Making Snapshot Isolation Serializable" (TODS 2005). The dangerous-structure theorem that underpins Postgres's SSI. Required reading if you want to understand what the next lab over from this one would add.

Production MVCC engines

  • PostgreSQL 16 documentation, chapter 13 ("Concurrency Control"). https://www.postgresql.org/docs/16/mvcc.html. Postgres's xmin/xmax hidden columns are exactly our commit_ts / tombstone scheme, just with the tombstone collapsed into the next row's xmin. Chapter 13.6 ("Caveats") names write-skew explicitly.

  • PostgreSQL src/backend/access/heap/heapam.c and src/backend/utils/time/snapmgr.c. The C implementation of HeapTupleSatisfiesMVCC, GetOldestXmin, and VACUUM's visibility logic. Our gc(below_ts) is a faithful (single-tenant, single-shard) port of OldestXmin-based pruning.

  • Peng & Dabek — "Large-scale Incremental Processing Using Distributed Transactions and Notifications" (OSDI 2010). The Google Percolator paper. Defines the two-timestamp (start_ts, commit_ts) protocol on top of Bigtable that became the template for TiKV, CockroachDB's earliest design, and YugabyteDB. Our single-node oracle is the trivial special case of the Percolator TSO.

  • Diaconu, Freedman, Ismert, Larson, Mittal, Stonecipher, Verma, Zwilling — "Hekaton: SQL Server's Memory-Optimized OLTP Engine" (SIGMOD 2013). The deepest publicly available description of an in-memory MVCC engine. Section 3 ("Concurrency Control") describes their lock-free version installation, their GC ("oldest active transaction" again), and their decision to ship both optimistic and pessimistic SI variants. Our store is Hekaton with the locks added back and the latches removed.

  • Wu, Arulraj, Lin, Xian, Pavlo — "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" (VLDB 2017). The paper that catalogues, benchmarks, and ranks the MVCC design decisions (storage layout, version-chain ordering, GC strategy, index pointer to head vs tail). It is the single most useful paper for anyone designing an MVCC engine from scratch.

  • Kemper & Neumann — "HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots" (ICDE 2011). HyPer uses fork() for snapshots instead of version chains — a fascinating alternative that this lab does not implement but every engineer should know exists.

SI in distributed systems

  • Sovran, Power, Aguilera, Li — "Transactional Storage for Geo-Replicated Systems" (SOSP 2011) — the Walter paper. Defines parallel snapshot isolation (PSI), a weaker form of SI tractable across data centers. Useful framing if you ever wonder why CRDB doesn't just run plain SI.

  • Bailis, Davidson, Fekete, Ghodsi, Hellerstein, Stoica — "Highly Available Transactions: Virtues and Limitations" (VLDB 2014). Maps the entire CAP / isolation landscape onto availability. SI is provably unachievable under network partitions; this paper tells you exactly where the line is.

Lecture material worth the read

  • CMU 15-721 ("Advanced Database Systems") lectures by Andy Pavlo, Spring 2023. Lecture 04 "Multi-Version Concurrency Control" walks through Postgres / Hekaton / HyPer / Oracle in one hour. Slides + recording are on the CMU course page.

  • Joe Hellerstein's Berkeley CS 186 notes, "Concurrency Control II". Undergraduate-level but the diagrams of conflict graphs and the worked write-skew example are the clearest I have seen.

Lab cross-references

  • db-09 (LevelDB Complete) — the storage engine these transactions could one day be layered on top of. The LSM's sequence numbers are essentially commit_ts in disguise.
  • db-12 (SQL Frontend) — produces the AST that this engine would execute. The natural db-13.5 lab would wire them together.
  • db-14 (Indexes and Query Optimization) — adds secondary indexes; under MVCC, indexes need their own version chains or a pointer-to-head + tuple-side timestamp scheme. See Wu et al. §4.
  • db-16+ (Distributed Fundamentals, Raft, Paxos) — replace the single-node oracle with a distributed timestamp service. The semantic model carries over unchanged.

Indexes and Query Optimization

1. What Is It

A secondary index is an auxiliary data structure that maps each value of a non-primary-key column to the set of row-ids that contain it. A query planner turns a logical query (predicates + projections) into a physical plan tree (scan → filters → project) and picks an access path per predicate. A rule-based planner uses fixed heuristics; a cost-based planner consults statistics. db-14 implements the rule-based half end-to-end and keeps the cost model deliberately tiny (rows / distinct_keys for =, (rows+2)/3 for ranges) so the byte-for-byte cross-language invariant is tractable.

2. Why It Matters

A SeqScan on N rows costs Θ(N) regardless of selectivity. A point lookup through a sorted (or hashed) index is O(log N) (or O(1) amortised). When predicates have selectivity ≪ 1 — the normal case in OLTP — choosing the right access path is the single largest performance lever a database has. And once two physical operators are available, you need a planner. Even a naive planner with the wrong cost model can be catastrophically slow on real workloads (see Leis et al., "How Good Are Query Optimizers, Really?", PVLDB 2015) — but it is also the concept through which every later optimisation (joins, partitioning, parallelism) gets expressed.

3. How It Works

            Query{projections, predicates}
                          │
                          ▼
                  ┌───────────────┐
                  │   Planner     │   rule-based:
                  │ estimate per  │   • Eq  → rows/distinct
                  │ indexable pred│   • Lt/Le/Gt/Ge → (rows+2)/3
                  │ pick min      │   • Ne / no-index → SeqScan
                  └──────┬────────┘
                         ▼
            Plan{ Pipeline[ scan, *Filter, Project? ] }
                         │
                         ▼
                  ┌───────────────┐
                  │   Executor    │   Volcano-style: scan rows,
                  │  scan→filter  │   retain on predicate,
                  │   →project    │   rewrite columns at end.
                  └──────┬────────┘
                         ▼
                     []Row

Indexes are BTreeMap<Value, Vec<row_id>> in Rust, std::map in C++, and a sorted slice of (Value, []row_id) in Go. Insertion order is preserved inside each bucket, which (combined with ascending key traversal) gives a total, deterministic output order shared across all three implementations.

4. Core Terminology

TermDefinition
Secondary indexSorted map from column value → list of row-ids.
Access pathConcrete way to read rows for a predicate (SeqScan vs IndexScan).
SelectivityFraction of rows a predicate keeps; 0 ≤ s ≤ 1.
Cardinality estimatePredicted row count out of an operator.
PipelineLinear chain of operators evaluated row-at-a-time (Volcano model).
Rule-based optimizerPicks plans from fixed heuristics; no statistics.
Cost-based optimizerSearches plan space; uses statistics + a cost function.
Covering indexIndex that includes every column the query needs (skip the row lookup).
Index-only scanRead the index without touching the heap.
TupleA single row of a relation.

5. Mental Models

  • Index = sorted dictionary. Equality is dictionary lookup; range is dictionary range(). Everything else is a special case.
  • Planner = predicate auctioneer. Each indexable predicate "bids" its estimated row count; the lowest bid wins the scan, the rest become Filters.
  • Executor = pipeline. Each operator pulls rows from its child. Operators don't materialise unless they must (project, sort, hash).
  • Wire format = correctness oracle. If three languages serialise the same plan and result bytes for the same query, they agree on planning + execution semantics. The SHA-256 collapses N MB of bytes into a 64-char string we can put in a case statement.

6. Common Misconceptions

  • "Indexes always speed up reads." False for low-selectivity predicates: a SeqScan reads the heap once; an IndexScan dereferences each row-id, which may be worse if most rows match.
  • "More indexes is free." Every write must update every index — and indexes cost RAM/disk.
  • "Rule-based is obsolete." Modern systems (SQLite, MySQL) ship rule-based fallbacks for simple queries because cost-based planning has its own pathologies (bad stats → catastrophic plans).
  • "Selectivity = 1 / distinct keys." Only under uniform-distribution assumption. Skewed data needs histograms or sampling.
  • "Project is free." Wide projection through long pipelines materialises copies; columnar engines avoid this; row engines pay for it.

7. Interview Talking Points

  • Explain how a B-tree index supports both point and range queries, and what changes for a hash index (point only, no ordering).
  • Walk through plan selection: predicates → estimates → cheapest scan → remaining as filters → optional project.
  • Why is index iteration order important? Determinism, merge-join inputs, ORDER BY elimination.
  • Explain Volcano vs vectorised execution. Tradeoffs?
  • What is a covering index, and when does it dominate?
  • Discuss EXPLAIN output: how do you read a query plan?
  • What can go wrong when the planner picks a SeqScan instead of an IndexScan (or vice versa)? Stale statistics, correlated predicates, type coercions that disable the index.

8. Connections to Other Labs

  • db-02 (data structures) introduced sorted maps; this lab puts them to work.
  • db-10 (B-tree) is the persistent counterpart of the in-memory index here.
  • db-12 (SQL frontend) produces the Query struct planners consume.
  • db-13 (transactions/MVCC) governs which rows the index sees per snapshot.
  • db-15 (SQLite-complete) stitches all of the above into a real engine.
  • db-22 (perf/benchmarking) measures the planner choices we make here.

9. Frozen Wire Format

Plan stream      = 0x05 (PIPELINE) | u32 LE child_count | child*
Child node tags:
   0x01 SeqScan    | u32 LE table_id(=0)
   0x02 IndexScan  | u32 LE col_idx | u8 op | u8 val_tag | <val>
   0x03 Filter     | u32 LE col_idx | u8 op | u8 val_tag | <val>
   0x04 Project    | u32 LE col_count | (u32 LE col_idx)*

Op codes : Eq=1 Ne=2 Lt=3 Le=4 Gt=5 Ge=6
Val tags : Int=1 (i64 LE) ; Text=2 (u32 LE len | bytes)

Result stream    = "DSEQR01" (7 bytes) | u32 LE row_count |
                   per row: u32 LE col_count | (u8 tag | <val>)*

Both streams are concatenated per op; the SHA-256 of that concatenation, across N ops, is the byte-identity oracle for the cross-language test.

References

Papers

  • Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., & Price, T. G. (1979). Access Path Selection in a Relational Database Management System. SIGMOD '79. The System R paper. Defines cost-based optimisation, dynamic programming over join orders, selectivity estimation. Every modern planner is a variation on this design.

  • Graefe, G. (1994). Volcano — An Extensible and Parallel Query Evaluation System. IEEE TKDE 6(1). The iterator-based "open/next/close" execution model used here. db-14's Executor is a flattened Volcano.

  • Graefe, G. (1995). The Cascades Framework for Query Optimization. Data Eng. Bulletin 18(3). Rule-based + cost-based search via memoisation on plan equivalence classes. Cascades is what SQL Server and CockroachDB derive from.

  • Graefe, G. (2011). Modern B-Tree Techniques. Foundations and Trends in Databases. The reference on B-trees — covers concurrent access, range scans, prefix compression, all relevant to "what an index is".

  • Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How Good Are Query Optimizers, Really? PVLDB 9(3). Empirical study showing that cardinality estimation errors dwarf cost-model errors; motivates why even very simple planners can be competitive.

Books

  • Hellerstein, J. M. & Stonebraker, M. (eds, 2005). Readings in Database Systems (the "Red Book"), 5th edition. Chapters on query processing and optimisation. Free online.

  • Garcia-Molina, H., Ullman, J. D., & Widom, J. (2008). Database Systems: The Complete Book, 2nd ed. Chapters 15–16 (query processing) and 17 (optimisation).

  • Ramakrishnan, R. & Gehrke, J. (2003). Database Management Systems, 3rd ed. Chapters 12–15 cover indexing and external sorting.

Production system docs

Source code

  • SQLite where.c — single-file implementation of the planner. ~10k LoC of cost-based reasoning over WHERE clauses. The reference.
  • LevelDB db/version_set.cc — for a non-SQL planner-style scoring function on file-picking in compaction.
  • CockroachDB pkg/sql/opt/ — Cascades-style optimiser in Go.

Analysis

Goal

Build the smallest end-to-end query engine that nonetheless exercises the three concepts a real planner must get right:

  1. Access-path selection — choose between SeqScan and IndexScan.
  2. Predicate ordering — apply the most selective predicate first.
  3. Projection placement — only carry the columns the query asked for.

All three must be deterministic across Rust, Go, and C++, because the artifact under test is the SHA-256 of the serialised plan + result bytes.

Scope

Pure in-memory. No SQL parser (queries are constructed structurally). No persistence. No transactions. One table, fixed three-column schema (id INT, name TEXT, age INT). Three scenarios — scanonly, mixed (index on age), indexheavy (indexes on age and name).

Design Decisions

Index physical form

A BTreeMap<Value, Vec<row_id>> per indexed column. Three reasons:

  • Sorted iteration is required for range scans.
  • The total order on keys is also a stable iteration order for the cross- language test; map randomisation (Go's default) would break that.
  • Each bucket's Vec<row_id> is naturally ascending because rows are appended in row_id order, so no per-bucket sort is needed.

Planner cost model

Deliberately simple, frozen across languages:

PredicateEstimate
Eq indexablerows / distinct_keys
Lt/Le/Gt/Ge ix(rows + 2) / 3
Nenot indexable → SeqScan
No matching indexnot indexable → SeqScan

The (rows+2)/3 is the standard "one-third selectivity" heuristic for inequalities used by SQLite when no histogram is available. rows/distinct for equality is the uniform-distribution maximum-likelihood estimator.

Tie-breaking

If two predicates produce the same estimate, the earlier one wins. This makes the choice deterministic without dragging in input-order-sensitive hashing.

Plan shape

A Plan is always a single Pipeline. Children are, in order: one scan, zero or more Filters (the non-chosen predicates), and an optional Project. No nested pipelines, no joins. Keeps the wire format flat.

Executor model

Volcano-style pull, but materialised at each operator. With at most a few thousand rows per query, the simplicity of materialisation is worth more than the cost of allocation, and it makes the row-emission order trivial to reason about. The "true" pull pipeline is the same code in a streaming shape — the lab doesn't need that subtlety.

Failure Modes Considered

  • Map randomisation breaks byte identity. Go's default map has a randomised iteration order. We use a parallel sorted slice; explicit sort.Slice is used everywhere a map could leak.
  • i64 / u32 endianness. Always little-endian, encoded with explicit byte slicing — never unsafe casts.
  • String collation. Text values are compared as raw bytes (std::memcmp / bytes.Compare / slice ==), never via locale-aware comparison.
  • Wrong magic length. The result-row magic is exactly 7 bytes DSEQR01 (no NUL terminator). C++ uses std::memcmp(..., 7), never strcmp.

Execution

Tasks Performed

  1. Schema + Value + Row in all three languages. Value is a tagged union of Int(i64) and Text(Vec<u8>) with a frozen total order (Int < Text cross-kind; natural order within kind).
  2. Secondary index as BTreeMap-equivalent: std::collections::BTreeMap in Rust, std::map in C++, a parallel sorted slice in Go (since Go's maps are randomised).
  3. Planner with the cost model from analysis.md. Single pass over predicates, lowest estimate wins.
  4. Executor that scans, filters, projects in order.
  5. Wire format (dump_plan, dump_result) using only little-endian primitives so the SHA-256 lines up across all three implementations.
  6. Workload driver (qplan workload ...) that prints sha256_hex(concat(dump_plan(plan) ++ dump_result(rows))) over N ops.
  7. Tests: 10 Rust + 11 Go + 11 C++ unit tests covering the eight planner behaviours, plus dump determinism and SHA-256 known-answer vectors.
  8. Scripts: verify.sh (build + unit tests), cross_test.sh (build all three binaries, run scenarios A and B, assert sha256 identity and match the frozen golden hashes).

Order of Implementation

Rust first (the lab's reference language). Go next, debugged against the Rust hashes. C++ last, debugged against both. Each language is self-contained — no shared library, no FFI — so a divergence shows up immediately as a different cross_test.sh hash.

Pitfalls Encountered

  • Go map iteration. The first Go prototype iterated map[Value][]uint32 directly and produced a different hash on every run. Replaced with a sorted []indexEntry slice and a findKey binary search.
  • C++ std::map<Value, ...>. Works only if Value::operator< is a strict weak order across kinds; the cross-kind Int < Text rule had to match Rust's PartialOrd derivation.
  • Result magic length. The lab spec freezes the magic at 7 bytes (DSEQR01, no terminator). An early C++ port wrote 8 bytes (NUL included) and the cross-test hash diverged at byte 8 of every op. Discovered by diffing the first 16 bytes of each binary's output for op 0.
  • u8 op byte for Pred. Rust's enum Op { Eq=1, ... } is #[repr(u8)]; Go and C++ mirror the constants explicitly. A missing #[repr(u8)] was the second source of byte divergence in the first iteration.

What's NOT Implemented

  • Joins of any kind.
  • Cost-based search over plan equivalence classes (Cascades).
  • Histograms / cardinality estimation beyond uniform-distribution.
  • Index updates on DELETE (rows are append-only).
  • Index merge (combining two IndexScans on different columns).
  • Persistence — see db-15 for the persistent counterpart.

Observation

Frozen golden hashes

Both scenarios produce SHA-256 hashes that are byte-identical across the Rust, Go, and C++ implementations. These are burned into scripts/cross_test.sh and into the Rust/Go/C++ test suites.

IDArgssha256
A--seed 42 --ops 200 --rows 500 --scenario mixed3918bc6eca225f1c9c004fdcefa6551788282a4a2223fa98b002e8b54eb74a2e
B--seed 7 --ops 500 --rows 2000 --scenario indexheavy9313fe694db38912a814abc16600d82f82ead7fc053e813af4bb3978c8fa9abd

If either hash changes, the wire format has drifted — CONCEPTS.md section 9 and all three test suites must be updated in lockstep.

Byte walkthrough — first op of scenario A

Scenario A drives 200 ops against a 500-row table with an index on column 2 (age). The first op uses (r3, r4, r5) from SplitMix64(42), gives kind = (r3 >> 60) & 3 = 0 ⇒ EQ on col = r4 % 3 of value pick_val_for(col, r5, 500).

Concretely the first op produces a Plan of:

Pipeline                       0x05 0x01 0x00 0x00 0x00     // 1 child
  IndexScan col=2 op=Eq Int(v) 0x02 0x02 0x00 0x00 0x00     // col_idx=2
                               0x01                          // Op::Eq
                               0x01                          // VTAG_INT
                               <i64 LE v>                    // 8 bytes

— 19 bytes total for the plan dump. The result row stream is:

"DSEQR01"                  0x44 0x53 0x45 0x51 0x52 0x30 0x31
u32 LE row_count           ....
per row: u32 col_count=3 | (tag|val) * 3

The 7-byte magic is deliberate — the length is part of the byte-identity contract.

Per-scenario telemetry

scenario A — mixed, 200 ops, 500 rows, 1 index (col=age)
  plan-kind distribution (theoretical, from (r3>>60)&3):
    EQ      ~50 %
    EQ alt  ~25 %     (kinds 0 and 1 both map to EQ)
    range   ~25 %
    project ~25 %
  → planner emits IndexScan when col == 2 and op != Ne (≈ 50 % of ops);
    otherwise SeqScan + Filter.

scenario B — indexheavy, 500 ops, 2000 rows, 2 indexes (col 1 and col 2)
  index coverage rises from ~33 % of predicates → ~66 %; the planner picks
  IndexScan on the smaller-bucket column ~3 × more often, dropping the
  total emitted-row count by an order of magnitude versus scanonly.

Unit test counts

rust  10 tests   cargo test --release --quiet
go    11 tests   go test ./...
cpp   11 tests   ctest --output-on-failure

All three suites cover the same eight planner behaviours plus dump determinism and SHA-256 known-answer vectors. Test 11 in Go and C++ anchors scenario A's hash directly so any drift fails at go test / ctest time, not just at cross_test.sh time.

Verification

What we verify

  1. Single-language correctness. Each language has a test suite that covers the eight planner behaviours (insert layout, index bucket ordering, EQ → IndexScan, range → IndexScan, NE → SeqScan+Filter, projection-only collapse, deterministic row emission order, two- predicate selection of the most selective).
  2. Determinism within a language. run_workload(cfg) is pure; the same cfg produces identical bytes (test 10 in each suite).
  3. Cross-language byte identity. cross_test.sh builds all three qplan binaries, runs scenarios A and B, asserts SHA-256 equality across the three outputs and equality with the frozen reference hashes (3918bc6e… and 9313fe69…).
  4. Sha256 implementation correctness. Rust and C++ ship their own SHA-256; the empty-string and "abc" known-answer vectors are checked in each unit-test suite. Go uses stdlib crypto/sha256.

How to run

bash scripts/verify.sh       # → "=== OK ==="
bash scripts/cross_test.sh   # → "=== ALL OK ==="

verify.sh runs cargo test, go test, and ctest in turn; any failure aborts with set -euo pipefail. cross_test.sh exits 1 on the first mismatch or drift from the frozen golden hash.

Hand-checks before changing wire format

If dump_plan or dump_result is touched intentionally, the workflow is:

  1. Update both functions in all three languages in the same commit.
  2. Run cross_test.sh — the outputs across languages must still match.
  3. Capture the new SHA-256 for scenarios A and B.
  4. Update the want_hash A / want_hash B lines in cross_test.sh.
  5. Update the test-11 anchor strings in src/go/idx14_test.go and src/cpp/tests/test_idx14.cc.
  6. Update the hash table in docs/observation.md and the wire-format section in CONCEPTS.md.

Skipping any step makes a future "did the wire format silently drift?" audit unreliable.

What we deliberately do NOT verify

  • Performance. db-22 owns benchmarking; this lab targets correctness only.
  • Concurrency. The structures are not thread-safe by design.
  • Large inputs. Scenarios A and B are sized so cross_test.sh finishes in well under a second on a laptop; the byte-identity property is size-invariant.

Broader Ideas

Beyond this lab

Cost-based optimisation

The cost model here estimates a single number per predicate. A real optimiser must:

  • Estimate cardinality and CPU/IO cost per operator.
  • Compose costs across operators (a Filter after an IndexScan costs scan-rows × predicate-eval-cost).
  • Handle join ordering (System R / Selinger DP).
  • Handle correlated predicates (x = 1 AND y = 1 where x and y are correlated — uniform-independence is the standard wrong assumption).

See Leis et al. 2015 — bad cardinality estimates dominate bad cost models.

Move from a single-pass rule-based planner to a search over plan-space:

  • Represent equivalent plan trees with a memo table.
  • Apply transformation rules (push filter below scan, merge filters).
  • Score each candidate; pick lowest-cost.
  • This is what SQL Server, CockroachDB, Apache Calcite do.

Index variants

  • Hash index — O(1) point lookup, no range.
  • Bitmap index — efficient AND/OR of many low-cardinality predicates; great for analytics, bad for OLTP.
  • Covering index — include extra columns to skip the heap read entirely (index-only scan).
  • Partial indexWHERE x > 100 predicate baked into the index, smaller but only usable when the predicate is implied by the query.
  • Functional index — index on lower(name) rather than name.
  • Multi-column index — order matters; left-most prefix rule.

Statistics

A real optimiser maintains:

  • Histograms (equi-depth, equi-width, or compressed) per column for range selectivity.
  • NDV (number of distinct values) per column for equality selectivity.
  • Correlation metrics between column pairs.
  • MCVs (most common values) for skewed distributions.

These need to be refreshed periodically — ANALYZE in PostgreSQL, sqlite_stat1 table in SQLite.

Adaptive query execution

  • Spark / Snowflake re-plan at operator boundaries based on observed row counts.
  • PostgreSQL has parallel-plan re-decisions; Vertica re-optimises mid-query.

Hardware angles

  • Pointer chasing through an in-memory B-tree is bound by L2 cache misses. Cache-oblivious B-trees and trie-based indexes (ART, HOT) reduce that.
  • SSDs make sequential scan competitive with random index reads up to surprisingly high selectivity (~10 % of the table).
  • GPUs and vector instructions favour columnar storage + vectorised scans over row-at-a-time indexing.

What I'd build next

  • Add a third index type — a hash index — and let the planner compare estimates across index families (O(1) hash beats O(log N) Btree on pure equality, ties broken by index size).
  • Add a nestedloop_join operator and extend the cost model so the planner picks build vs probe side.
  • Add a tiny ANALYZE step that snapshots distinct_keys and lets Planner::estimate consult cached stats instead of walking the index each call.

Step 01 — Secondary Index

Goal

Build the Table structure: rows, schema, and a BTreeMap-equivalent secondary index per indexed column.

Why

A secondary index is the smallest, most universal unit of query optimisation. Once you can map column-value → row-ids in sorted order, equality and range predicates become dictionary operations, and the planner has a real choice to make.

What to do

  1. Pick a tagged-union Value type: Int(i64) or Text(bytes). Implement a frozen total order: Int < Text across kinds; natural order within each kind. This is the byte-identity contract.

  2. Define Column { name, type } and Row = Vec<Value>.

  3. Implement Table::new(schema), Table::insert(row), Table::create_index(col_idx). The index is keyed by Value and maps to an ascending Vec<row_id>.

  4. Decide the physical form per language:

    • Rust: BTreeMap<Value, Vec<u32>>.
    • C++ : std::map<Value, std::vector<std::uint32_t>>.
    • Go : sorted slice of (Value, []uint32) + binary search. Never iterate a Go map for cross-language tests — the order is randomised.
  5. insert must update every existing index before moving the row. create_index must iterate rows in order (so each bucket's row-id list is naturally ascending without per-bucket sort).

Acceptance

  • Inserting the same rows in the same order produces the same index contents byte-for-byte across languages.
  • An EQ lookup on an indexed column returns rows in ascending row-id order.
  • A range lookup walks the index in ascending key, ascending row-id within each bucket.

Common pitfalls

  • Storing the wrong column in the bucket (off-by-one on col_idx).
  • Copying the Value lazily and then losing the bytes when the row moves — clone before insert.
  • Letting Go's range over map leak into any code path that touches the index — every iteration must be over the sorted slice.

Step 02 — Rule-Based Planner + Executor

Goal

Turn a Query { projections, predicates } into a Plan { Pipeline[scan, *Filter, Project?] }, then execute it.

Why

Once a table has indexes, every predicate has at least two physical implementations: scan the whole heap and filter, or jump into the index. Picking the right one is what a planner does. Even a 10-line rule-based planner outperforms "always SeqScan" by orders of magnitude on selective queries, and it teaches the same vocabulary (selectivity, access path, predicate pushdown) you need for cost-based work later.

What to do

  1. Estimate per predicate (return None if not indexable):

    • Op::Ne → not indexable.
    • No index on the column → not indexable.
    • Op::Eqrows / distinct_keys.
    • Op::Lt / Le / Gt / Ge(rows + 2) / 3.
  2. Pick the lowest estimate (strict <, so earlier predicate wins ties — determinism matters).

  3. Build the Plan:

    • First child: IndexScan{col, op, val} if a predicate was chosen, else SeqScan{table_id: 0}.
    • Remaining predicates become Filter nodes in input order.
    • If projections is non-empty, append Project{cols}sort and dedupe the column list so the wire format is canonical.
  4. Execute the Plan (Volcano-style, but materialise at each operator for simplicity):

    • Scan emits all matching rows in (key, row-id) order.
    • Filter retains rows where eval_pred(row, predicate) is true.
    • Project rewrites each row to the requested column subset.
  5. IndexScan for each op:

    • Eq → fetch the single bucket (or empty).
    • Lt → iterate range(..val).
    • Le → iterate range(..=val).
    • Gt → iterate range(val+ε..).
    • Ge → iterate range(val..).
    • Within each bucket, emit row-ids in ascending order.

Acceptance

  • Given an indexed column and an Op::Eq predicate, the planner emits IndexScan, and the executor returns the matching rows in row-id order.
  • Given a non-indexable predicate (Op::Ne, or no index), the planner emits SeqScan and a Filter.
  • Given two predicates, the planner picks the one with the lower estimate; the other becomes a Filter.
  • Projections with duplicates (e.g. [2, 0, 2]) end up as [0, 2].

Common pitfalls

  • Forgetting to clone the predicate value when moving it into IndexScan — both the chosen and discarded predicates need a copy.
  • Using <= instead of < for tie-breaking — only < keeps the choice deterministic when two predicates tie.
  • Returning rows from Le (<=) by stopping at the first key greater than val instead of strictly greater — off-by-one on bounds.
  • Mutating the input Query.projections instead of cloning before sort / dedup.

Step 03 — Cross-Language Byte Identity

Goal

Make Rust, Go, and C++ produce byte-identical dump_plan(plan) ++ dump_result(rows) streams for the same workload, and verify it with SHA-256.

Why

If three implementations of the same spec produce the same bytes on a randomly-seeded workload, they agree on every observable behaviour — plan choice, operator order, row emission order, value encoding. A single divergent byte is the difference between "we have a spec" and "we have three programs that happen to look similar".

What to do

  1. Freeze the wire format in CONCEPTS.md section 9. Plan tags 0x01..0x05, op codes 0x01..0x06, val tags 0x01..0x02, result magic "DSEQR01" (7 bytes, no terminator).

  2. Implement dump_plan / dump_result in each language. Use only little-endian primitives — never platform-native byte order. C++: std::memcpy of to_le_bytes-equivalent expressions; never reinterpret int*. Go: binary.LittleEndian.PutUint32 / PutUint64. Rust: to_le_bytes().

  3. Implement RunWorkload identically:

    • SplitMix64(seed) with the canonical constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB.
    • For each row i in 0..rows: draw r1, r2; insert (IntV(i), Text("n" + (r1 % 1000)), IntV(r2 % 100)).
    • After insertion, apply the scenario's indexes (none / col 2 / cols 2+1).
    • For each op: draw r3, r4, r5; derive kind = (r3 >> 60) & 3, col = r4 % 3. Build the query per kind (0/1 → EQ, 2 → range with op = ((r5>>1)&1) ? Lt : Gt, 3 → projection-only).
    • Plan, execute, append dump_plan ++ dump_result to the rolling output.
  4. CLI: qplan workload --seed S --ops N --rows R --scenario X prints sha256_hex of the rolling output with no trailing newline.

  5. Compare: scripts/cross_test.sh runs both scenarios across all three binaries and asserts the three hashes match each other and the frozen golden hashes.

Acceptance

  • scripts/verify.sh ends with === OK === (unit tests pass in all three languages).
  • scripts/cross_test.sh ends with === ALL OK === (cross-language bytes match; golden hashes match).
  • Anchor tests (test 11 in Go and C++) verify scenario A's SHA-256 at unit-test time, so drift is caught even without running cross_test.sh.

Common pitfalls

  • Trailing newline from println! / fmt.Println / std::cout << std::endl will change the binary's stdout. Use write_all / Write / fwrite and flush.
  • Magic length. Writing "DSEQR01\0" (8 bytes) instead of 7 makes every op-boundary off by one. The byte-walkthrough in docs/observation.md is the canonical reference if in doubt.
  • Map iteration order in Go. Use sorted slices for any structure whose iteration order ends up in the wire bytes.
  • #[repr(u8)] missing on Rust enums. Without it, op as u8 may not equal the constants 1..6.
  • bool packing. Some C++ standard-library std::vector<bool> paths are surprising; never put a bool in the wire format — promote to std::uint8_t.
  • SHA-256 final byte ordering. The output is big-endian per word; hex-encoding mistakes swap nibbles. The empty-string known answer (e3b0c442...) catches this immediately.

db-15 — Sqlite-shaped engine, end-to-end

Where this sits

This lab is the capstone for the SQLite-style track. Earlier labs (db-10 .. db-14) build the parts in isolation: B-tree (db-10), pager (db-11), SQL frontend (db-12), MVCC transactions (db-13), indexes (db-14). Here we fuse a deliberately small slice of all of them into one engine and prove the slice is reproducible across Rust, Go, and C++ down to the byte.

The goal is not feature parity with real SQLite — that would dwarf the lab. The goal is to exhibit, in code small enough to keep in your head, the join between:

  • a primary index keyed by integer,
  • a secondary index keyed by text,
  • an MVCC tombstone scheme governed by a monotonic transaction id,
  • a deterministic snapshot wire format that any of the three reference implementations can produce identically.

Data model

A single table:

kv(k INT primary key, v INT, tag TEXT)

Physical row:

Row { v: i64, tag: String, created_at: u64, deleted_at: u64 }

deleted_at == 0 means the row is live; anything else is the txid at which it was tombstoned. Tombstoned rows stay in the primary map (they appear in the snapshot dump so a verifier can audit historical state) but they disappear from the secondary index immediately on delete.

In-memory layout:

  • primary: ordered map<i64 -> Row> — sorted by key. Holds tombstones.
  • secondary: ordered map<String -> sorted Vec<i64>>live rows only. Each list is kept strictly ascending.

A single monotonically-increasing next_txid (starts at 1) governs visibility. Read-only SELECT never bumps it. Write ops bump only when they actually mutate state.

SQL surface

Only four ops, deliberately:

opsemanticstxid bump?
INSERT(k, v, tag)UPSERT — replaces an existing row (even a tombstoned one) with a fresh row whose created_at is the new txid.always
UPDATE(k, v, tag)live-only. If the row is missing or tombstoned, returns false and does not bump txid. Keeps original created_at. Maintains the secondary index across tag changes.only if work was done
DELETE(k)live-only. Marks deleted_at = txid, removes the row from the secondary index.only if work was done
SELECT_BY_K(k) / SELECT_BY_TAG(tag)read-only.never

The semantic gotcha for cross-language identity is the no-op rule on UPDATE and DELETE. If any implementation bumps txid on a missing key, every subsequent created_at / deleted_at will drift and the snapshot diverges.

Snapshot wire format

Magic = "DSESQL15" (8 bytes, ASCII).

magic[8] || next_txid:u64 LE || primary_row_count:u32 LE
for each k in ascending order:
    k:i64 LE
    v:i64 LE
    tag_len:u32 LE
    tag_bytes
    created_at:u64 LE
    deleted_at:u64 LE
secondary_distinct_keys:u32 LE
for each tag in ascending lexicographic order:
    tag_len:u32 LE
    tag_bytes
    key_count:u32 LE
    for each key in ascending order: i64 LE

Three properties this format is built for:

  1. Total order at every level. Both the primary and secondary sections iterate in a sort order that is well-defined regardless of the host hash map (a real bug we hit in early Go drafts: map iteration is randomised, so a for k, v := range without an explicit sort produces a different byte stream on every run).
  2. Tombstones are observable. Including tombstones in the primary dump means the snapshot reflects the visibility scheme, not just the live set — useful when comparing two implementations' MVCC behaviour.
  3. Self-delimiting. Every variable-length string is preceded by its length, so a parser does not have to guess.

Deterministic workload

run_workload(seed, ops, keys, scenario) is the only entry point used in cross-language testing. It draws three 64-bit words per op from a splitmix64 seeded with seed:

r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 60) & 0x7
k    = (i64) (r2 % keys)
v    = (i64) (r3 % 10_000)
tag  = "t" + ((r3 >> 32) % 16)
match kind {
    0,1,2 => INSERT(k, v, tag)   // 3/8 of ops
    3,4   => UPDATE(k, v, tag)   // 2/8
    5     => DELETE(k)           // 1/8
    6     => SELECT_BY_K(k)      // 1/8
    7     => SELECT_BY_TAG(tag)  // 1/8
}

Two non-obvious rules:

  • Reads still draw all three rng words. Even though SELECT_BY_K only needs k, it still draws r3. Skipping the draw would shift the rng stream for every subsequent op and break determinism across scenarios.
  • tag = "t" + decimal(n). Decimal string formatting, not hex — trivially easy to get wrong in C++ where std::ostringstream << std::hex is the default reflex.

Frozen golden hashes

Captured from the Rust release build. The cross-language test asserts these byte for byte.

ScenarioCLI argsSHA-256
A--seed 42 --ops 500 --keys 32 --scenario defaulte8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
B--seed 7 --ops 2000 --keys 128 --scenario defaultdd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69

Sources of cross-language divergence

A non-exhaustive checklist that we hit while building the three ports:

  • Map iteration order. Go map iteration is randomised. Always collect keys then sort.Strings/sort.Slice before any side-effecting iteration that contributes to the wire stream.
  • Signed vs unsigned k. r2 % keys is unsigned modular arithmetic in all three languages; we then cast to i64. A cast through int on 32-bit platforms would lose bits. C++ uses static_cast<int64_t>, Rust as i64, Go int64(...).
  • Tag formatting. Use base-10 only. Padding, hex, or uppercase would all change the bytes silently.
  • Splitmix64 constants. All three implementations use the same triple: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB. Forgetting the ULL suffix in C++ truncates the constants to 32 bits and produces a different stream.
  • SHA-256. Rust uses sha2, Go uses crypto/sha256, C++ ships an inline reference implementation in src/cpp/src/sql15.cc. A canonical test vector (SHA256("abc")) is asserted in every test suite to catch a broken implementation before it pollutes a scenario hash.
  • No trailing newline from the CLI. The shell-level test compares "$RUST_BIN ..." with "$GO_BIN ..." as raw strings; an extra \n from one of them silently fails the equality. Rust uses print!, Go uses fmt.Print, C++ uses std::cout << ... with no << endl.

What this lab does not model

Listed up front so the reader does not look for them:

  • No on-disk persistence, no WAL, no pager. The "snapshot" is an in-memory byte stream produced on demand.
  • No concurrent transactions. MVCC visibility rules are implemented, but there is only one writer.
  • No query planner; SELECT_BY_K and SELECT_BY_TAG are direct map lookups.
  • No DDL. The schema is hard-coded.

Those are deliberately deferred to db-21 and the capstone (db-23).

References

Books

  • Sippu, S., & Soisalon-Soininen, E. (2015). Transaction Processing: Management of the Logical Database and its Underlying Physical Structure. Springer. Chapter 6 ("Logical Database Updates") gives the cleanest treatment of the no-op-update / no-op-delete rule that governs txid allocation here.
  • Bernstein, P. A., Hadzilacos, V., & Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Wesley. Chapter 5 on multiversion concurrency control is the source of the "tombstone with deleted-at txid" representation we use.
  • Hellerstein, J. M., Stonebraker, M., & Hamilton, J. (2007). Architecture of a Database System. Foundations and Trends in Databases, 1(2). Provides the layering vocabulary (storage manager, access methods, query processor) we slice through here.

Papers

  • Reed, D. P. (1978). Naming and Synchronization in a Decentralized Computer System. MIT/LCS/TR-205. The original MVCC paper.
  • Bernstein, P. A., & Goodman, N. (1981). Concurrency Control in Distributed Database Systems. ACM Computing Surveys 13(2). Lays out the timestamp-ordering protocols that motivate our monotonic txid.

Source documentation

Cross-language byte-identity practice

  • Google's protobuf canonical encoding spec — https://protobuf.dev/programming-guides/encoding/. The discipline of sorting map entries before serialisation comes from there.
  • CBOR deterministic encoding (RFC 8949 §4.2). Same idea applied to a different format. Useful background for why we sort the secondary index lexicographically rather than by insertion order.

Earlier labs in this workspace

analysis

The shape of the problem

We want the smallest engine that still demonstrably integrates the five things the SQLite-track labs build separately: a primary keyed container, a secondary index, an MVCC visibility scheme, a SQL surface, and a reproducible on-the-wire snapshot. "Smallest" here means: any feature we cut must be a feature that other labs already cover or labs after this will cover (db-21, db-23).

Three forces pull on the design:

  1. It has to be correct in three languages at once. Cross-language byte identity is the cheap, mechanical proof that the implementations agree. Anything that varies between language runtimes (hash map ordering, string formatting, integer width, signedness on casts) becomes a hazard.
  2. It has to be small enough to keep in your head. The whole engine is ~400 lines per language. That budget forced us to drop the pager, the on-disk format, and any kind of query planner.
  3. It has to actually test the integration. A no-op UPDATE that silently bumps the txid would not be caught by the unit tests in any one language — only the cross-language hash comparison would expose it.

Why MVCC over locking

A locking implementation would have been smaller, but it would not have produced a visible artefact for the snapshot. With MVCC we have the row-level created_at / deleted_at pair as observable state, and the snapshot dump can carry it. That gives us something to compare.

Why a secondary index

Without one, the snapshot would be just a sorted map dump and the cross-language test would degenerate into "do all three languages sort ints the same way" (trivially yes). The secondary forces us to also sort strings deterministically, which is where Go's randomised map iteration would otherwise bite.

Where the test surface actually catches bugs

A pleasant surprise: most of the time the unit tests in any one language pass and only the cross-language script fails. That is diagnostic in itself — it almost always points at either:

  • a missing sort.Strings / sort.Slice in Go,
  • a static_cast<int> instead of static_cast<int64_t> in C++,
  • an unsuffixed 0x9E3779B97F4A7C15 constant in C++ that the compiler promotes to int (and then warns about, but the warning is buried in a thousand-line build log).

The two frozen scenarios are deliberately sized:

  • Scenario A (--ops 500 --keys 32): small enough to debug by re-running with a smaller op count and diffing the intermediate snapshots.
  • Scenario B (--ops 2000 --keys 128): large enough to thrash the secondary index and the tombstone code path.

execution

Order of operations we actually used

  1. Pick the reference implementation. Rust first, because the type system catches the easiest mistakes (signed/unsigned, missing match arms) at compile time. Once 13 unit tests pass in Rust, freeze the golden hashes from the release build.
  2. Port to Go. Mirror the structure 1:1. The only language-shaped differences are: an explicit sort.Slice everywhere a Rust BTreeMap iteration is implicit, and fmt.Sprintf("t%d", n) in place of Rust's format!("t{}", n).
  3. Port to C++. Same structure again. Use std::map instead of std::unordered_map so iteration is sorted-by-key for free. Use std::ostringstream for the tag, never std::to_string with locale-aware formatting.
  4. Write the cross-language script last. Build all three CLI binaries, run both scenarios, assert pairwise equality and equality to the goldens.

Running it

$ cd db-15-sqlite-complete
$ bash scripts/verify.sh
=== Rust ===
... test result: ok. 13 passed; 0 failed
=== Go ===
ok      github.com/10xdev/dse/db15      ...
=== C++ ===
OK 13 tests
=== OK ===

$ bash scripts/cross_test.sh
=== scenario A: --seed 42 --ops 500 --keys 32 ===
  rust=e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
  go  =e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
  cpp =e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
=== scenario B: --seed 7 --ops 2000 --keys 128 ===
  rust=dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
  go  =dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
  cpp =dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
=== ALL OK ===

Things that went wrong in development

  • First Go run produced a different hash for scenario A. Cause: ranging directly over c.secondary instead of collecting keys and calling sort.Strings. The fix is in src/go/sql15.go; see DumpSnapshot.
  • First C++ run also diverged. Cause: std::unordered_map instead of std::map. Same fix shape — switch container, or sort keys before iteration. We chose std::map for symmetry with Rust's BTreeMap.
  • A test asserted SHA256("abc") and failed. Typo in the expected hex string (extra 3, missing trailing d). The canonical value is ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad. Worth pinning a known SHA-256 vector in every cross-language lab.

observation

What we measured

For each implementation, on every commit:

  • All unit tests pass under release optimisation. (Debug-only bugs are real — assert(side_effect) under -DNDEBUG is the classic.)
  • The CLI binary produces both golden hashes.
  • The cross-language script produces the same hash from all three binaries.

What the bytes look like

The snapshot from scenario A is 7088 bytes. Roughly:

  • 8 bytes magic
  • 8 bytes next_txid (~501 for scenario A; some ops are no-ops)
  • 4 bytes primary row count (≈ 28 of 32 possible keys are touched)
  • Per row: 8 + 8 + 4 + len(tag) + 8 + 8 = 36 + len(tag) bytes
  • 4 bytes secondary distinct tag count
  • Per (tag, keys): 4 + len(tag) + 4 + 8*key_count bytes

The largest single section is the primary; the secondary is small because the tag alphabet is fixed at 16.

Visibility of tombstones

Because tombstoned rows stay in the primary, you can read the snapshot and recover the current visible state by filtering on deleted_at == 0. That property let us write a test that asserts the primary row count equals live_count + tombstone_count, which caught a regression where exec_delete was removing the row from the primary instead of marking it.

The shape of kind distribution

Across 2000 ops in scenario B, the empirical distribution of kind matches the design 3:2:1:1:1 ratio within ~3% — confirming that splitmix64's top 3 bits are sufficiently uniform that we do not need a rejection sampler.

Non-determinism we did not observe

  • No flakes across 50 runs of scenario B.
  • No drift between debug and release builds for any implementation.
  • No drift between macOS arm64 and Linux x86_64 (sanity-checked once in a throwaway container).

All three of those properties are load-bearing for the cross-language test to be useful: if any of them fail, the script becomes a flaky test and people learn to ignore it.

verification

The verification ladder

  1. Unit tests inside each language. 13 tests per implementation, covering insert/update/delete semantics, the no-op rule on missing keys, secondary-index maintenance across tag changes, the tombstone-then-reinsert path, the wire format byte layout, and the two frozen scenarios.
  2. scripts/verify.sh runs all three suites end to end.
  3. scripts/cross_test.sh builds all three CLI binaries and asserts byte-identical SHA-256 across Rust/Go/C++ for both scenarios and equality with the frozen goldens.

What each layer protects against

LayerCatches
Unit testsWrong semantics within one language: e.g. UPDATE bumping txid when the row was missing.
Frozen golden in unit testsDrift in one language only: e.g. someone "fixes" the splitmix64 constants.
Cross-language scriptCross-language drift: e.g. Go iterating a map without sorting.
Both goldensDrift that happens to leave one scenario unchanged. Hitting two seeds at very different op counts is a cheap insurance policy.

Test vectors we pinned

In every language:

  • SHA256("") = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
  • SHA256("abc") = ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
  • splitmix64(0) = 0x8b57dafca0cee644

If any of those fail, every higher-level test is meaningless, so they run first.

How to debug a cross-language mismatch

If cross_test.sh reports a mismatch:

  1. Re-run with smaller --ops (say 10) until the divergence appears. SHA-256 is binary — either equal or not — so you need to dump the actual bytes.
  2. Add a temporary print of the snapshot's hex before the SHA, in both the suspect language and the reference. xxd or od -An -tx1 on the two outputs and diff them.
  3. The first byte that differs almost always points at a section boundary. Decode the next_txid and primary row count by hand.
  4. The two most common causes by a wide margin are (a) unsorted map iteration and (b) a missing ULL on a C++ constant. Check those first.

Coverage gaps we accept

We do not run a property-based test (no proptest in Rust, no testing/quick in Go). The two seeded scenarios are dense enough that we have not seen a real bug that proptest would have caught and they would not, and adding proptest would make the test loop slower and more flaky.

broader ideas

What a "real" SQLite slice would add

If the goal is fidelity rather than pedagogy, the natural next steps, roughly in order of payoff:

  • A pager backed by pwrite. This is db-11. Once you have a pager the snapshot becomes the file, not an ad-hoc byte stream.
  • Page-level checksums. Even just XXH3 per page; turns silent corruption into noisy corruption.
  • A WAL. Append-only journal of operations, replay on open. db-03 does the WAL; the fusion is in db-23.
  • Schema and DDL. CREATE TABLE, multiple tables, column types. The single-table assumption hides almost all the catalog complexity.
  • A query planner. Even just a cost-based decision between SELECT_BY_K and SELECT_BY_TAG would be educational. With one table and two indexes the planner is trivial; with joins it explodes.

What this engine could become with concurrency

The MVCC bookkeeping is already there — created_at and deleted_at. What is missing for real read-mostly concurrency:

  • A reader-visible snapshot timestamp, so SELECT reads consistently as of "the latest committed txid I saw".
  • Write-set tracking and a commit barrier, so two writers cannot both bump txid without serialising.
  • Garbage collection of tombstoned rows once no live reader could observe them. The current code holds tombstones forever, which is fine for a benchmark and disastrous for a real system.

The interesting thing is that the snapshot wire format would still work — you would just be dumping a consistent point rather than the literal in-memory state.

What this engine could never be

Without on-disk persistence, this is not a database; it is a test fixture. Adding the pager moves it to "embedded KV with SQL", which is roughly what SQLite is.

It will never be a server. Network protocols, connection management, client-side query plans, authentication — none of that is in scope for any lab in this series, by design.

Useful tangent: cross-language byte identity as a discipline

This lab is a microcosm of a discipline that pays off elsewhere:

  • gRPC and protobuf rely on a canonical encoding for hash-based signing.
  • Git's object hashing depends on a canonical layout per object type.
  • Bitcoin transactions are SHA-256-d in a canonical byte form.

Whenever you find yourself asking "is this implementation correct", producing a canonical byte stream from each implementation and hashing it is one of the cheapest mechanical proofs you can buy.

step 01 — pager-and-rows

Goal

Build the in-memory primary container. By the end of this step you should have:

  • a Row { v, tag, created_at, deleted_at } value type,
  • a Conn with next_txid starting at 1 and an ordered primary: i64 -> Row map,
  • exec_insert (UPSERT) bumping next_txid every call,
  • select_by_k returning live rows only.

Why "pager-and-rows" not just "rows"

Even though we are not implementing a real on-disk pager here, the discipline of treating the primary map as the single source of truth for both visible and tombstoned state mirrors what a pager gives you: a flat, ordered store you can walk in key order.

If you wanted, you could later swap the in-memory std::map for a B-tree built on top of db-11's pager and not have to change anything else in this lab.

Tasks

  1. Define Row and Conn in your chosen language.
  2. Implement exec_insert with UPSERT semantics. Make sure that inserting at the same key twice replaces the row and uses the new txid in created_at.
  3. Implement select_by_k. It must filter out tombstoned rows.
  4. Write a unit test that inserts two keys, selects them back, and asserts next_txid == 3.

Pitfalls

  • If you use a hash map (Go map, C++ unordered_map), the wire test in step 03 will fail because iteration order will not be deterministic. Use an ordered map (BTreeMap, sorted-keys iteration, std::map).
  • Use i64 for k. i32 will silently truncate when the workload in step 03 mods a u64 by keys and casts.

step 02 — SQL surface and MVCC

Goal

Add exec_update, exec_delete, select_by_tag, and the secondary index. By the end of this step:

  • exec_update must be a no-op (and not bump next_txid) if the row is missing or tombstoned. If present, it keeps the original created_at and only mutates v and tag.
  • exec_delete must be a no-op if the row is missing or tombstoned. If present, it sets deleted_at = next_txid and removes the row from the secondary index. The row stays in the primary.
  • The secondary index tag -> sorted Vec<i64> is maintained on every mutating op. Only live rows are present.
  • select_by_tag(tag) returns the secondary list, or empty.

Tasks

  1. Wire exec_update and exec_delete with the no-op-on-missing rule. Test it by calling each on a key that does not exist and asserting next_txid did not move.
  2. Implement secondary insertion as sorted insert (binary search + shift, or BTreeMap::entry().or_default() + sorted insert).
  3. Implement secondary removal as sorted lookup + erase. If the list becomes empty, drop the tag entirely (otherwise the snapshot will carry empty entries and diverge from the spec).
  4. Add a test that inserts three rows with the same tag in scrambled key order, then asserts select_by_tag returns them in ascending order.
  5. Add a test for the resurrection path: insert, delete, insert again on the same key. The new row must have a fresh created_at and deleted_at == 0.

Pitfalls

  • The most common bug is bumping next_txid in exec_update even on a no-op. The unit tests in one language will pass; the cross-language hash will diverge after the first missing-key update.
  • Forgetting to drop an empty tag from secondary after the last delete will add a zero-length entry to the snapshot dump and break cross-language byte equality.
  • In C++, std::map::operator[] default-constructs missing entries silently — use find for reads and [] only when you intend the insert.

step 03 — cross-language snapshot

Goal

Produce the canonical snapshot byte stream defined in ../CONCEPTS.md, run the deterministic workload in each language, and assert byte-identical SHA-256 across Rust, Go, and C++.

By the end of this step:

  • dump_snapshot exists in every language and produces bytes that match the spec section-for-section.
  • A run_workload(seed, ops, keys, scenario) function exists in every language and is bit-exact.
  • The CLI prints the hex SHA-256 with no trailing newline.
  • scripts/verify.sh ends with === OK ===.
  • scripts/cross_test.sh ends with === ALL OK === and reports both golden hashes for scenarios A and B.

Tasks

  1. Implement dump_snapshot. Build it incrementally: write the magic + header first, get a single-row dump matching by hand, then add the secondary section.
  2. Implement splitmix64 and a stateful SplitMix64::next(). Pin a test for splitmix64(0) == 0x8b57dafca0cee644 to guard against constant typos.
  3. Implement run_workload per the rules in CONCEPTS.md. Pay special attention to: drawing all three rng words even for read ops; the kind decoding (r1 >> 60) & 0x7; the modulo casts to i64.
  4. Implement sha256_hex. In Rust use the sha2 crate. In Go use crypto/sha256 + encoding/hex. In C++ inline the reference implementation (FIPS 180-4) — keep it in the same translation unit as the engine to avoid a third-party dependency. Pin SHA256("") and SHA256("abc") in tests.
  5. Wire up the CLI: sqlitectl workload --seed N --ops N --keys N --scenario S. Print the hex with print! / fmt.Print / std::cout — no newline.
  6. Run scripts/verify.sh then scripts/cross_test.sh. Iterate until both end with their success markers.

Debugging a divergence

If cross_test.sh shows different hashes between languages, follow the ladder in ../docs/verification.md: shrink the op count, dump the raw snapshot bytes with xxd, diff, and look for the first differing byte. It almost always points at a section boundary that exposes either map-iteration order or a wrong-width cast.

Acceptance

  • All three unit suites pass under release optimisation.
  • Both === OK === and === ALL OK === markers appear.
  • Scenario A hash: e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab.
  • Scenario B hash: dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69.

db-16 — Distributed-Fundamentals

This lab builds the vocabulary the rest of the distributed track (db-17 Raft, db-18 Paxos, db-19 ZAB, db-20 distributed-kv) will speak in: logical clocks, vector clocks, the happens-before relation, and a deterministic discrete-event simulator that produces a byte-identical event log across three independent implementations (Rust, Go, C++).

If you cannot write a simulator whose output is bit-stable across runs and across languages, you cannot run reproducible distributed-systems experiments. Every other lab in the track will reuse the discipline established here.


What is it?

A distributed system is a collection of nodes that exchange messages over an asynchronous, lossy network. Three primitives let us reason about such systems without having a wall-clock everyone agrees on:

  1. Lamport clock — a single integer per node that is incremented on every local event, stamped onto each outgoing message, and bumped to max(self, incoming) + 1 on receive. Lamport (1978) proved that this discipline produces a total order consistent with causality: if event a happens-before event b, then ts(a) < ts(b). The reverse is not true.

  2. Vector clock — one counter per node, packaged into a map. Local event increments the owner's counter; receive does pointwise max(self, incoming) then increments the owner's counter. The resulting partial order is the happens-before relation: two events are concurrent iff neither clock dominates the other.

  3. Deterministic discrete-event simulator — a single-threaded loop that drives sim time forward in integer ticks, delivering messages whose delivery_time == t before letting nodes act. With a seeded PRNG and canonical message ordering, the same (seed, nodes, rounds) triple must always produce the same event log — in any language.


Why does it matter?

  • Raft (db-17), Paxos (db-18), ZAB (db-19) all rely on causality: a leader can only commit an entry after it has been replicated to a quorum of followers. Vector clocks give us the language to prove that a particular log entry could not have been committed before a prerequisite was acknowledged.

  • Reproducibility is the difference between "I think my consensus algorithm is correct" and "I have an event log I can re-run on someone else's machine and get the same answer." When db-17 develops a leader-election bug under network partition, the first thing you reach for is a deterministic replay of the failure.

  • Three independent implementations forces clarity. Any ambiguity in the spec ("when do you read the clock vs. increment it?") will show up as a byte diff in scripts/cross_test.sh. Pinning the wire format and the scheduling rule is the lab.


How does it work?

Lamport rule

local event :  self += 1
send        :  self += 1 ; stamp message with self
recv(m)     :  self = max(self, m.ts) + 1

Vector-clock rule

local event(i)    :  vc[i] += 1
send(i)           :  vc[i] += 1 ; stamp message with snapshot of vc
recv(i, m)        :  for k in m.vc : vc[k] = max(vc[k], m.vc[k])
                     vc[i] += 1
partial order     :  vc_a < vc_b   iff (∀k) vc_a[k] ≤ vc_b[k]  AND  vc_a ≠ vc_b
                     vc_a || vc_b  iff neither <  nor  > nor =

Simulator loop

for t in 0 .. rounds + MAX_DELAY:
    # 1. deliver — strict (delivery_time, sender_id, seq) order
    while heap.top().delivery_time == t:
        msg = heap.pop()
        node[msg.dest].recv(msg)
        emit Recv

    # 2. send — only during the active window
    if t < rounds:
        for s in 0 .. nodes:
            r       = splitmix64(seed ^ (t<<32) ^ (s+1))
            dest    = ((r          & 0xFFFF) % (nodes - 1)) ; skip self
            delay   = 1 + ((r>>16) & 0xFFFF) % 3
            payload = (r>>32) & 0xFF
            node[s].send_to(dest, delay, payload)
            emit Send

The two phases (deliver-then-send) per tick, the strict heap ordering, and the splitmix64 PRNG together guarantee determinism.

Canonical wire format

file := magic[4="DSE6"] u32_le(event_count) event*

event :=
    u8  kind                  # 1 = Send, 2 = Recv
    u64_le sim_time
    u32_le node               # sender for Send, receiver for Recv
    u32_le peer               # dest for Send, source for Recv
    u64_le lamport            # value AFTER the local step
    u32_le vc_len
    [u32_le node, u64_le counter] * vc_len   # sorted ASC by node
    u32_le payload_len
    u8 payload[payload_len]

All multi-byte numbers are little-endian. Vector-clock entries must be serialized in ascending order by node-id; this is the single most common source of byte-diff bugs.

Cross-language invariants

InvariantWhy it matters
splitmix64 mix seed ^ (t<<32) ^ (s+1)identical PRNG stream
dest skip-self: if pre >= s then pre+1identical destination choice
heap order (delivery_time, sender, seq)identical delivery order
seq is global monotonicdeterministic tie-break across nodes
VC entries sorted by node-id on the wirebyte-identical serialization
all integers little-endianbyte-identical on every host

If any one of these drifts, scripts/cross_test.sh will fail at the sha256 compare and cmp -l will print the byte offset of the first divergence.


Files

  • src/rust/distfund16 crate + simctl binary.
  • src/go/ — module github.com/10xdev/dse/db16 + cmd/simctl.
  • src/cpp/db16_lib static library + simctl binary + test_db16.
  • scripts/verify.sh — runs the unit tests for all three.
  • scripts/cross_test.sh — proves the three binaries produce byte-identical event logs for two seeded scenarios.

See docs/ for the longer write-up, and steps/ for the staged implementation path.

db-16 — References

Primary sources

  • Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, CACM 21(7), 1978. The original paper. Defines happens-before, the logical clock, and (in §4) the construction of a total order consistent with causality. https://lamport.azurewebsites.net/pubs/time-clocks.pdf
  • Colin Fidge, Timestamps in Message-Passing Systems That Preserve the Partial Ordering, 11th ACSC, 1988. Introduces vector clocks and proves they characterize the happens-before relation exactly.
  • Friedemann Mattern, Virtual Time and Global States of Distributed Systems, 1989. The companion vector-clock paper; reads more approachably than Fidge.
  • Sebastiano Vigna, splitmix64 — a small, fast, well-distributed 64-bit mixer used as the seeder for xoroshiro. https://prng.di.unimi.it/splitmix64.c

Determinism and reproducibility

  • Frans Kaashoek et al., Eraser: A Dynamic Data Race Detector for Multithreaded Programs, SOSP 1997. Not directly cited here, but the motivation — "if you cannot replay a bug deterministically you cannot debug it" — is the entire reason this lab exists.
  • FoundationDB's simulation testing (Apple/Snowflake) — a production example of deterministic discrete-event simulation at scale. https://apple.github.io/foundationdb/testing.html
  • Jepsen — Kyle Kingsbury's distributed-systems testing harness. Not deterministic itself (it injects real faults), but the methodology of "generate events, observe a history, check it against a model" is the vocabulary db-16 sets up. https://jepsen.io/

Production engines that use these primitives

  • Riak / Dynamo — vector clocks for sibling reconciliation.
  • CRDTs (Shapiro, Preguiça, Baquero, Zawirski, 2011) — vector clocks and version vectors are the substrate for state-based merge functions.
  • TLA+ — Lamport's specification language; ordering events by Lamport clock is the mental model behind every TLA+ refinement proof.

Cross-lab dependencies

  • This lab has no upstream dependencies. It is the bedrock for the distributed track.
  • db-17 Raft consumes the simulator: leader-election scenarios and log-replication invariants will be expressed as scripted event sequences run against a deterministic transport built on top of db-16.
  • db-18 Paxos, db-19 ZAB, db-20 distributed-kv reuse the same vocabulary (Lamport/VC for causality assertions, deterministic scheduler for fault-injection replay).

db-16 — Analysis

Required invariants

  1. Lamport monotonicity. For any node n, the sequence of Lamport values produced by its successive tick/send/recv calls is strictly monotonic.
  2. Lamport consistency with happens-before. If a → b (happens- before), then ts(a) < ts(b). The converse does not hold; that is the cost of compressing a partial order into a single integer.
  3. Vector-clock characterization. With vector clocks the biconditional holds: a → b iff vc(a) < vc(b) componentwise (and vc(a) ≠ vc(b)).
  4. Send-precedes-receive. Every Recv event in the simulator is paired with exactly one Send event from (peer → node) whose sim_time is strictly less than the Recv's sim_time and whose vector clock is strictly less than the Recv's.
  5. Byte determinism. For every (seed, nodes, rounds), the three binaries produce identical bytes on stdout. This is the single property scripts/cross_test.sh checks; if it ever drifts, all downstream labs lose reproducibility.

Design decisions

  • Two-phase tick (deliver-then-send). Each integer tick first drains all in-flight messages whose delivery_time has arrived, then runs every node's send. Doing deliver first means a single tick can witness a message being received and a response being sent — capturing causal flow without needing finer time resolution.

  • Heap ordered by (delivery_time, sender, seq). The third field (seq, a global monotonic counter) gives an unconditional tie-break even when two nodes send to the same destination in the same tick with the same chosen delay.

  • splitmix64 seeded per (seed, t, s). A single splitmix64 call produces all three random fields (dest, delay, payload) for one (t, s) decision. This avoids the question "whose RNG state advances first" — there is no shared RNG state at all.

  • Vector-clock entries sorted on the wire. BTreeMap in Rust, sorted-key iteration in Go, std::map in C++ all produce ascending order naturally. If you ever switch the Rust side to HashMap you will get byte diffs.

  • Lib + thin CLI. All three implementations expose the same trio of primitives (Lamport, VectorClock, simulate/Simulate) as a library. The CLI is ten lines that calls serialize(simulate(...)) and writes to stdout. Downstream labs will link the library, not shell out to the CLI.

Why three languages

  • Forces the spec to be unambiguous. A Rust BTreeMap and a C++ std::map both happen to iterate in key order; the moment you reach for Go's map you discover the language does not and you must sort explicitly. That kind of discovery only happens with multiple implementations.
  • Pins endianness, integer overflow semantics (wrapping), and signed-vs- unsigned modulo. Splitmix64 in particular depends on unsigned wrapping multiplication; expressing it identically in three languages is a forcing function.
  • Future-proofs the track. db-17 onwards will pick one host language per experiment; having a reference implementation in three independent languages means a port bug in db-17's Raft simulator can be cross- checked against the db-16 baseline.

Tradeoffs worth flagging

  • Sim time is integer ticks, not floating-point seconds. This trades realism for determinism. Real networks have continuous-time jitter; capturing that would require an event-priority structure keyed by a rational/decimal time, which is not worth the complexity for a study lab.
  • All sends are unicast and always succeed. We do not model drops, reorderings beyond delay-based interleaving, or partitions. db-17 will add a partition primitive on top of this simulator; doing it here would mean adding --drop-rate to the CLI and changing the wire format, which would lock in a poor abstraction.
  • Each node sends exactly one message per tick during the active window. That is a fixed-load workload. Variable-load (silent nodes, bursty senders) would be a strict extension; it is intentionally omitted to keep the spec small enough to verify by hand.

db-16 — Execution

One-shot: prove the lab works

cd db-16-distributed-fundamentals
./scripts/verify.sh        # all unit tests in Rust, Go, C++
./scripts/cross_test.sh    # byte-identical event logs across all three

A green run of cross_test.sh ends with the literal line:

=== ALL OK ===

Per-language workflows

Rust

cd src/rust
cargo test                # 7 tests
cargo build --release     # produces target/release/simctl
./target/release/simctl --seed 42 --nodes 3 --rounds 20 > /tmp/log_rust.bin

Go

cd src/go
go test ./...             # 7 tests
go build -o /tmp/simctl_go ./cmd/simctl
/tmp/simctl_go --seed 42 --nodes 3 --rounds 20 > /tmp/log_go.bin

C++

cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build   # test_db16 → "db-16 C++ tests: 7 passed"
./build/simctl --seed 42 --nodes 3 --rounds 20 > /tmp/log_cpp.bin

CLI

All three binaries accept the same flags:

flagdefaultmeaning
--seed N0splitmix64 seed
--nodes K3number of nodes; must be ≥ 2
--rounds R20number of send-rounds (sim time runs for R + MAX_DELAY ticks)

The output is the binary wire format described in CONCEPTS.md. Pipe to a file; do not display on a terminal.

Canonical scenarios

scripts/cross_test.sh runs two scenarios; their sha256s are checked into the lab's verification path:

labelargssha256
A--seed 42 --nodes 3 --rounds 200d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc797
B--seed 7 --nodes 5 --rounds 50321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d4

If you change any of: PRNG, scheduler order, wire format, or VC entry ordering — these hashes will change and you must update both the script and this table in the same commit. That synchronization step is the forcing function that keeps the spec honest.

Sanity checks

# magic bytes
./target/release/simctl --seed 42 --nodes 3 --rounds 20 | xxd -l 8
# expect:  00000000: 4453 4536 7800 0000  DSE6x...
# 0x78 = 120 = events: 60 Sends + 60 Recvs for nodes=3 rounds=20

# event count
./target/release/simctl --seed 42 --nodes 3 --rounds 20 | \
  python3 -c 'import sys,struct; d=sys.stdin.buffer.read(); print(struct.unpack("<I", d[4:8])[0])'
# → 120

db-16 — Observation

What does the simulator's output actually look like, and how do you read it by hand?

offset 0x00 :  44 53 45 36                 "DSE6"   (magic)
offset 0x04 :  78 00 00 00                 120      (event_count, u32 LE)

For --seed 42 --nodes 3 --rounds 20 the event count is 3 nodes × 20 rounds × 2 (send + recv) = 120.

A single Send event

Every Send is the start of a causal arc; every Recv is its endpoint. The first event in scenario A is a Send from node 0 at sim_time 0:

01                       kind = 1 = Send
00 00 00 00 00 00 00 00  sim_time   = 0
00 00 00 00              node       = 0    (sender)
?? 00 00 00              peer       = ?    (destination, computed from PRNG)
01 00 00 00 00 00 00 00  lamport    = 1    (Send rule: self += 1, then stamp)
01 00 00 00              vc_len     = 1
00 00 00 00 01 00 00 00 00 00 00 00   (node=0, counter=1)
01 00 00 00              payload_len = 1
??                       payload byte

Note the vector clock for a node that has only sent has a single entry (its own counter). Receivers' vector clocks grow as they merge incoming clocks.

A single Recv event

Recvs look identical except kind = 2 and peer is the source node:

02                       kind = 2 = Recv
?? ?? ?? ?? ?? ?? ?? ??  sim_time   = original send time + delay
01 00 00 00              node       = 1   (receiver)
00 00 00 00              peer       = 0   (sender of paired Send)
?? ?? ?? ?? ?? ?? ?? ??  lamport    = max(self_before, incoming) + 1
02 00 00 00              vc_len     = 2
00 00 00 00 01 00 00 00 00 00 00 00   merged entry for node 0
01 00 00 00 ?? 00 00 00 00 00 00 00   own counter, incremented
01 00 00 00              payload_len = 1
??                       payload byte (copied from send)

The number of VC entries grows as a node hears from new peers; in a 3-node, 20-round run each receiver will eventually have all 3 entries.

Hex walkthrough

./simctl --seed 42 --nodes 3 --rounds 20 | xxd | head

Read column-by-column:

00000000: 4453 4536 7800 0000  DSE6 . . . . . . . .      header
00000008: 01 00 00 00 00 00 00 00 00                      first Send: kind=1, sim_time=0
                                  00 00 00 00            node=0
00000014: ?? 00 00 00                                     peer
00000018: 01 00 00 00 00 00 00 00                         lamport=1
00000020: 01 00 00 00                                     vc_len=1
00000024: 00 00 00 00 01 00 00 00 00 00 00 00             vc entry (0 → 1)
00000030: 01 00 00 00                                     payload_len=1
00000034: ??                                              payload byte
00000035: 02 ...                                          next event (probably another Send at t=0)

The whole file for scenario A is 8156 bytes; scenario B is 45592 bytes.

What to learn from looking at it

  • Lamport values are non-decreasing within a node but may regress between nodes — that is healthy: nodes 0 and 1 can be ahead of node 2 if 2 hasn't sent or received yet.
  • The vector-clock entry for node i in node i's own events is strictly monotonic.
  • For any Send/Recv pair, the Recv's VC must dominate the Send's VC (> in VcOrd). This is exactly what check_causality asserts.
  • If you sort all events by sim_time you get a globally consistent "tape" — but events at the same sim_time are concurrent and have no inherent ordering between nodes. Deliveries are scheduled before sends within a tick by simulator policy, not by physics.

Cross-language reading

scripts/cross_test.sh prints the hex of the first 8 bytes (44534536 7800 0000 for scenario A). If three implementations agree on those 8 bytes but disagree on the rest, the suspect is almost always either (a) VC-entry order on the wire, or (b) heap tie-break by sender id.

db-16 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

  • macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
  • cmake ≥ 3.20.
  • Rust toolchain ≥ 1.74.
  • Go ≥ 1.22.
  • shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-16-distributed-fundamentals
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-16-distributed-fundamentals/src/rust
cargo test --quiet
cargo build --release

Expected: 7 passed; 0 failed. The simctl binary lands in target/release/simctl.

Go

cd db-16-distributed-fundamentals/src/go
go test ./...
go build ./cmd/simctl

Expected: ok github.com/10xdev/dse/db16 <duration>.

C++

cd db-16-distributed-fundamentals/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and test_db16 prints "db-16 C++ tests: 7 passed".

What "green" means

A green run guarantees:

  • All 21 unit tests pass (7 each in Rust, Go, C++) covering Lamport monotonicity, vector-clock partial order including the Concurrent case, simulator determinism on a fixed seed, and causality of the generated event log.

  • The cross-language test produces byte-identical event logs for both canonical scenarios:

    scenariosha256size
    A --seed 42 --nodes 3 --rounds 200d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc7978 156 B
    B --seed 7 --nodes 5 --rounds 50321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d445 592 B

    Matching sha256s prove that all three implementations agree on the PRNG, the scheduling rule, the Lamport / vector-clock update rules, the VC entry ordering on the wire, and the integer endianness.

  • The spot-check in cross_test.sh confirms the magic header 44 53 45 36 and the expected u32 LE event count, guarding against the regression where all three implementations agree on producing empty output.

When verification fails

  • Cross-language sha256 mismatch on the first 8 bytes — magic / count drift. Almost always a count formula bug (2 × nodes × rounds).
  • Mismatch past byte 8 but matching on a smaller --rounds — the PRNG or the scheduler diverges as soon as a recv-in-flight overlaps with a send. Inspect splitmix64 and the heap tie-break.
  • Causality test fails in one language only — that language's recv does not bump its own counter, or bumps before the merge. Read the Vector-clock rule in CONCEPTS again.
  • One language passes locally but the cross-test diverges — most often: VC entries serialized in insertion order rather than sorted by node-id. Switch to BTreeMap / std::map / explicit sort.Slice.

db-16 — Broader Ideas

Where the primitives in this lab show up in real systems, and what to build on top of them in the rest of the distributed track.

Immediate next labs

  • db-17 — Raft. Reuses the deterministic simulator wholesale. Adds a Role ∈ {Follower, Candidate, Leader}, election timeouts, AppendEntries RPCs, and a commit index. Every step's safety argument ultimately reduces to "this state could not have been reached without a quorum acknowledgement", which is a happens-before statement on the log — exactly what vector clocks make precise.
  • db-18 — Paxos. Same harness, different message types (Prepare/Promise/Accept/Accepted). Paxos's invariants are notoriously hard to reason about by hand; a deterministic simulator that can replay a counterexample seed is the difference between "I think it's correct" and "I have evidence".
  • db-19 — ZAB. Adds a strict total order on broadcasts and a recovery phase. The Lamport clock generalizes to the ZAB epoch + counter pair.
  • db-20 — Distributed KV. Wraps a quorum-replicated key-value store around a chosen consensus engine. Now the simulator's "payload" is a client command, and the event log is auditable per-replica state.

How this lab's pieces map to real systems

  • Lamport clocks are the kernel of Kafka offsets, Spanner's TrueTime (kind of — Spanner adds a real-time uncertainty interval but the underlying scalar is a Lamport-like ID), and Cassandra's per-cell write timestamps.
  • Vector clocks are the kernel of Amazon Dynamo's conflict detection, Riak's siblings, and the CRDT literature's "stable causal delivery" layer.
  • Deterministic discrete-event simulation is how FoundationDB developed and continues to harden its storage and replication code (Will Wilson's Testing Distributed Systems with Deterministic Simulation talk at Strange Loop 2014 is the canonical reference). It is also how TigerBeetle, Polar Signals, and Antithesis test their production code paths.
  • The (time, sender, seq) heap tie-break is the same trick used by every event-loop sim from simpy to game-engine fixed-timestep loops.

Performance experiments worth running later

  • Crank --nodes and --rounds and plot wall time vs. event count for each language. With the current canonical serializer this should be linear in events; any quadratic growth means the wire format or the heap is doing something dumb.
  • Replace the unicast splitmix64 destination with a broadcast and measure the explosion in VC entries per receive (each broadcast forces every other node's VC to grow by one entry).
  • Try a HashMap-based VC in Rust and observe cross_test.sh failing. This is the cheapest possible lesson on why deterministic iteration order matters; do it once and you will never forget it.

What "production-quality" would require beyond this lab

  • A real network layer (TCP or QUIC), with retries, timeouts, and application-level acks rather than the simulator's deliver-and-forget.
  • Lossy / reordering channels and partition primitives. db-17 will add partitions as a Network::partition(a, b) toggle; this lab deliberately omits them so the determinism story is small.
  • Persistent storage for clocks (so a crash-restart doesn't replay Lamport from 0). The Raft lab in db-17 will need this; the WAL we built in db-03 is the obvious substrate.
  • Compact vector clocks (interval tree clocks, dotted version vectors) for systems with > thousands of nodes; the naive map-based VC here becomes a bandwidth problem at that scale.

None of these change the shape of the primitives — they make the same primitives faster, leaner, and tolerant of real-world failures.

db-16 step 01 — Logical clocks

Goal

Implement Lamport and vector clocks as first-class types in all three languages, with identical semantics under a small, well-defined API.

Tasks

  1. Lamport clock. A wrapper over a u64 counter exposing:
    • tick() — bump the counter, return the new value.
    • send() -> u64 — equivalent to tick: bump, then return the stamp for the outgoing message.
    • recv(incoming: u64)self = max(self, incoming) + 1.
    • value() -> u64.
  2. Vector clock. A wrapper over Map<u32, u64> exposing:
    • tick(self_id)vc[self_id] += 1.
    • send(self_id) -> Self — bump own counter, return a clone of the full VC (the snapshot that gets stamped onto a message).
    • recv(self_id, incoming: &Self) — pointwise vc[k] = max(vc[k], incoming[k]) for every key k in incoming, then bump the receiver's own counter.
    • partial_cmp(other) -> {Less, Equal, Greater, Concurrent}. Pure function over the two maps.
  3. Wire serialization for the VC. Entries on the wire MUST be sorted ascending by node_id. This is non-negotiable — it is the single biggest source of byte-diff bugs across languages.

Acceptance

Inline unit tests in each language:

  • lamport_tick_monotonic — three ticks produce 1, 2, 3.
  • lamport_recv_jumpsrecv(10) after value 3 yields 11.
  • vc_partial_order_less{0:1} < {0:1, 1:1}.
  • vc_partial_order_concurrent{0:2, 1:0} and {0:0, 1:2} are concurrent.
  • vc_recv_merges_then_ticksrecv(self=1, {0:5, 1:0}) from initial {1:2} yields {0:5, 1:3}.
  • vc_serialize_sorted — the bytes are identical no matter what order entries were inserted in the map.

All six green in Rust, Go, and C++.

Discussion prompts

  • Why does recv bump the receiver's own counter after the merge rather than before?
  • Why is the "Concurrent" outcome of partial_cmp necessary; what goes wrong if you collapse it into Equal or Less?
  • For a system with one million nodes, is a map-keyed VC still practical? What data structures replace it (hint: interval tree clocks, dotted version vectors)?

db-16 step 02 — Deterministic simulator

Goal

Build a discrete-event simulator whose (seed, nodes, rounds) triple completely determines its event log, and produce a canonical serialization of that log.

Tasks

  1. PRNG. Implement splitmix64 in each language with unsigned wrapping multiplication. Seed it per-decision with seed ^ (t << 32) ^ (s + 1) so that no shared mutable PRNG state crosses a (t, s) boundary. This eliminates the "whose turn is it to read the RNG?" ambiguity that bites every multi-language implementation.

  2. Per-tick decision. For each (t < rounds, s ∈ 0..nodes), compute:

    • dest_pre = (r & 0xFFFF) % (nodes - 1) then skip-self: dest = dest_pre + (1 if dest_pre >= s else 0).
    • delay = 1 + ((r >> 16) & 0xFFFF) % 3.
    • payload = (r >> 32) & 0xFF.
  3. Scheduler. Maintain a min-heap of in-flight messages keyed on (delivery_time, sender_id, global_seq). global_seq is a single monotonic counter incremented every time a message is enqueued. This guarantees a total order even when two messages have identical (delivery_time, sender_id).

  4. Tick loop. For t in 0 .. rounds + MAX_DELAY:

    1. Drain all heap entries with delivery_time == t: for each, run recv on the destination node, emit a Recv event.
    2. If t < rounds: for each s in 0..nodes, compute the decision, enqueue the message, run send on the sender, emit a Send event.
  5. Wire format. As documented in CONCEPTS.md. Magic "DSE6", u32 LE event count, then event_count events. Each event uses little-endian integers and serializes its vector clock with entries sorted ascending by node id.

Acceptance

Inline unit tests:

  • splitmix64_known_values — for seed=0, the first three outputs are 0xE220A8397B1DCDAF, 0x6E789E6AA1B965F4, 0x06C45D188009454F.
  • sim_deterministic_one_node--nodes 2 --rounds 3 --seed 1 produces a fixed event count and a fixed first-event byte sequence.
  • sim_event_count_formula — for any (nodes ≥ 2, rounds ≥ 1), total events = 2 * nodes * rounds (every send becomes exactly one recv).
  • causality_holds — after running simulate(...), walk the event log: every Recv from peer has a strictly-greater VC than the paired Send.
  • byte_round_trip — serializing the same event log twice yields identical bytes (no nondeterminism in the serializer itself).

All five green in Rust, Go, and C++.

Discussion prompts

  • Why deliver before send within a single tick?
  • What breaks if global_seq is per-sender instead of global?
  • The simulator never drops or reorders messages beyond delay-based interleaving. What new wire-format field would --drop-rate p need, and would it break the cross-language hash if defaulted to 0?

db-16 step 03 — CLI and cross-language byte-identity

Goal

Build a simctl CLI in all three languages, then prove via sha256 that all three produce byte-identical event logs for the same (seed, nodes, rounds) triple — for at least two distinct scenarios.

CLI contract

simctl --seed N --nodes K --rounds R

Writes the canonical wire-format bytes (no trailing newline) to stdout.

Tasks

  1. Build simctl in Rust (src/rust/src/bin/simctl.rs), Go (src/go/cmd/simctl/main.go), and C++ (src/cpp/src/simctl.cc).
  2. Write scripts/verify.sh that runs unit tests in all three langs.
  3. Write scripts/cross_test.sh that:
    1. Builds all three binaries.
    2. Scenario A: simctl --seed 42 --nodes 3 --rounds 20 → sha256 all three outputs → assert all three match.
    3. Scenario B: simctl --seed 7 --nodes 5 --rounds 50 → sha256 all three → assert all three match.
    4. Spot-check the first 8 bytes of scenario A's output equal the magic "DSE6" plus the u32 LE count 120.
    5. Print === ALL OK ===.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
...
  match(A): 0d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc797
  match(B): 321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d4
=== ALL OK ===

A byte-identical hash across three independent implementations is a near-proof that the PRNG, scheduler, clock-update rules, and wire format are all spec-compliant. Any divergence — even on a single byte — will surface here.

Discussion prompts

  • Why two scenarios instead of one? What property would slip through with a single scenario that two catch?
  • If the scenario-A hash matches but scenario B does not, where in the codebase would you start looking?
  • The sha256 hashes are baked into the script as constants. What's the benefit, and what's the maintenance cost when the wire format legitimately evolves (e.g., adding a new event kind)?

db-17 — Raft

This lab implements Raft consensus in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It builds directly on the deterministic-simulator discipline from db-16: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule.

If db-16 taught you to keep an event log bit-stable across three languages, db-17 teaches you to keep an entire replicated state machine's persistent state bit-stable across three languages and across network partitions. Every later distributed lab (db-18 Paxos, db-19 ZAB, db-20 distributed-kv) is a variation on this skeleton.


What is it?

Raft (Ongaro & Ousterhout, USENIX ATC 2014) is a consensus algorithm that keeps an ordered, append-only replicated log consistent across a cluster of nodes despite crashes, message reorderings, and arbitrary partitions of the network. It is the consensus core inside etcd, Consul, TiKV, CockroachDB, MongoDB's metadata, and many more.

Raft decomposes consensus into three sub-problems:

  1. Leader election. Each node is one of {Follower, Candidate, Leader}. Followers run an election timeout; on timeout a follower becomes a candidate, bumps its current_term, votes for itself, and broadcasts RequestVote. A candidate that receives a majority of vote_granted=true replies in the same term becomes leader.

  2. Log replication. The leader accepts client proposals and appends them to its log. It broadcasts AppendEntries RPCs carrying the new entries plus a prev_log_index / prev_log_term consistency check. On a mismatch the follower rejects; the leader decrements next_index[follower] and retries. Once a majority's match_index covers entry N and log[N].term == current_term, the leader advances commit_index to N.

  3. Safety. Election restriction (a candidate only earns a vote if its log is at least as up-to-date as the voter's), the commit-only-current-term rule, and the log-matching property (identical (index, term) ⇒ identical entries) together imply state machine safety: once an entry at index i is applied at one node, no other node will ever apply a different entry at i.

This lab implements the algorithm as it appears in Figure 2 of the paper, minus snapshots and minus membership changes. The simulator drives sim time forward in integer ticks; messages are scheduled into a heap with a deterministic (delivery_time, sender, seq) order; an optional partition set drops messages in one direction between named pairs.


Why does it matter?

  • Raft is the production consensus algorithm of the 2010s. Knowing exactly how prev_log_index works, why commit advance is gated on log[N].term == current_term, and why the election restriction exists is the difference between operating etcd and understanding etcd.

  • Three byte-identical implementations forces the spec to be unambiguous. Anywhere Raft "depends on the implementation" — RPC scheduling, election timer jitter, tiebreak for "which leader gets a proposal", iteration order of peer ids — has to be pinned down. The cross-language sha256 makes drift loud.

  • Reproducible partitions. With a deterministic --partition s,d,... flag and a seeded simulator, you can replay the exact sequence of message drops, leadership changes, and committed entries that triggered a bug, on any machine, in any of the three languages.

  • Foundation for the rest of the track. db-18 Paxos and db-19 ZAB will reuse the simulator harness; db-20 distributed-kv will plug a consensus engine into a real key-value store.


How does it work?

State (per node)

persistent  : current_term : u64
              voted_for    : Option<u32>          # None == -1 on the wire
              log          : Vec<LogEntry>        # 1-indexed in Figure 2; 0-indexed here
volatile    : role         : Follower | Candidate | Leader
              commit_index : u64                  # highest log index known committed
              last_applied : u64                  # we apply lazily; rarely diverges from commit_index
leader-only : next_index   : Map<peer_id, u64>    # index of next entry to send to each peer
              match_index  : Map<peer_id, u64>    # highest entry known replicated on each peer
timers      : election_deadline : u64             # sim-time tick
              heartbeat_due     : u64             # next time leader must send AE

Election timer

reset_election_timer(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

A 150-tick base plus 150 ticks of seeded jitter avoids the classic split-vote loop. Heartbeats fire every 50 ticks.

RequestVote handling

on RequestVote(term, candidate, last_log_index, last_log_term):
    if term > current_term:                # newer term seen
        become_follower(term)
    grant = (term == current_term)
         && (voted_for is None or voted_for == candidate)
         && candidate_log_is_at_least_as_up_to_date()
    if grant:
        voted_for = candidate
        reset_election_timer()
    reply RequestVoteReply(current_term, grant)

Up-to-date is defined as: last_log_term > my_last_term, or (last_log_term == my_last_term && last_log_index >= my_last_index).

AppendEntries handling

on AppendEntries(term, leader, prev_idx, prev_term, entries, leader_commit):
    if term > current_term: become_follower(term)
    if term < current_term: reply (current_term, false); return
    reset_election_timer()
    if prev_idx > 0 && (log too short OR log[prev_idx-1].term != prev_term):
        reply (current_term, false); return        # consistency mismatch
    # truncate any conflicting suffix, then append
    for (i, e) in enumerate(entries):
        idx = prev_idx + i
        if idx < log.len() && log[idx].term != e.term:
            log.truncate(idx)
        if idx >= log.len():
            log.push(e)
    if leader_commit > commit_index:
        commit_index = min(leader_commit, log.len())
    reply (current_term, true, match_index = prev_idx + len(entries))

Commit advance (leader only)

advance_commit():
    for N in (log.len() ..= commit_index + 1).rev():
        if log[N-1].term != current_term: continue   # Figure 8 safety
        replicated = 1 + count(p : match_index[p] >= N)
        if 2 * replicated > nodes:
            commit_index = N; break

Propose (leader only)

propose(cmd):
    log.push(LogEntry{ term: current_term, command: cmd })
    match_index[self] = log.len()
    broadcast_append_entries()
    advance_commit()           # NB: required for n == 1, harmless for n > 1

The advance_commit() call inside propose is the one non-obvious detail. In a single-node cluster the leader has no peers, so no AppendEntriesReply will ever arrive to trigger a commit — but a majority is already satisfied (the leader alone is the majority). All three implementations call advance_commit() at the end of propose for byte-identical behaviour.

Simulator loop (per tick t in 0..rounds)

1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader  : pick (max term, min id) among Leaders; call propose
3. deliver in-flight           : pop heap entries with delivery_time == t
4. tick all nodes              : iterate in ascending id; on_tick may fire election or heartbeat

Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Deterministic, evenly spread, and independent of cluster behaviour.

Wire format (Rpc)

Four variants; all field widths fixed; little-endian:

RequestVote       { term: u64, candidate: u32, last_log_index: u64, last_log_term: u64 }
RequestVoteReply  { term: u64, granted: bool (u8) }
AppendEntries     { term: u64, leader: u32, prev_idx: u64, prev_term: u64,
                    entries: [LogEntry], leader_commit: u64 }
AppendEntriesReply{ term: u64, success: bool (u8), match_index: u64 }

The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. Only the canonical dump is serialized, and that is what gets hashed.

Canonical dump format

file := magic[8 = "DSERAFT1"] u32_le(node_count) node*

node := u32_le id
        u64_le current_term
        i64_le voted_for                # -1 if None (two's complement little-endian)
        u8     role                     # Follower=0, Candidate=1, Leader=2
        u64_le commit_index
        u32_le log_len
        entry * log_len

entry := u64_le term
         u32_le cmd_len
         u8 cmd[cmd_len]

Nodes appear in ascending id order. All multi-byte numbers are little-endian. The dump is hashed with SHA-256; the lowercase hex digest is what raftctl prints (no trailing newline).

Cross-language invariants

InvariantWhy it matters
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EBidentical PRNG output
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150identical election firing times
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3identical message scheduling
heap order (delivery_time, sender, seq); seq global monotonicidentical delivery sequence
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++)identical broadcast order
leader-pick for proposal injection: (max term, min id)identical client routing
proposal schedule: (i+1) * rounds / (K+1) integer divisionidentical pending queue contents
propose() calls advance_commit()identical commit_index for n=1
voted_for = None encoded as i64 LE -1identical dump bytes
Role enum order Follower=0, Candidate=1, Leader=2identical dump bytes

If any one of these drifts, scripts/cross_test.sh will fail and cmp -l on the two raw dumps will print the byte offset of the first divergence.


Files

  • src/rust/raft17 crate + raftctl binary.
  • src/go/ — module github.com/10xdev/dse/db17 + cmd/raftctl.
  • src/cpp/db17_lib static library + raftctl binary + test_db17.
  • scripts/verify.sh — runs the unit tests for all three.
  • scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.

db-17 — References

Primary sources

  • Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014. The Raft paper. Figure 2 is the spec this lab implements; Figure 8 is the motivation for the "commit only entries of the current term" rule. https://raft.github.io/raft.pdf
  • Diego Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD dissertation, 2014. The book-length treatment. Chapters 3–4 cover what's in this lab; chapters 5–6 cover snapshots, log compaction, and membership changes (deferred to db-21 / db-23). https://github.com/ongardie/dissertation
  • raft-tla — the TLA+ specification of the algorithm, also by Ongaro. Useful when you want a second, machine-checked statement of the same rules implemented here. https://github.com/ongardie/raft.tla

Implementations to read alongside

  • etcd/raft (Go) — the most-studied production Raft. Same Figure 2 structure; adds pre-vote, leader leases, learner replicas, ReadIndex, joint consensus. https://github.com/etcd-io/raft
  • hashicorp/raft (Go) — Consul's engine. Easier to read than etcd's because it carries less production scar tissue. https://github.com/hashicorp/raft
  • tikv/raft-rs (Rust) — TiKV's port of etcd's algorithm. Useful as a counterpoint to this lab's stdlib-only Rust version. https://github.com/tikv/raft-rs

Determinism and simulation

  • db-16's references on FoundationDB simulation testing and TigerBeetle apply verbatim here.
  • Hermitian (CockroachDB) and Antithesis are commercial deterministic simulators for distributed databases; the spirit is the same as cross_test.sh.

Background reading worth doing

  • Heidi Howard et al., Flexible Paxos: Quorum intersection revisited, OPODIS 2016. Helps see Raft as a specialization of Paxos with a fixed quorum intersection rule.
  • Lamport's Paxos Made Simple — for the db-18 transition.
  • Junqueira et al., ZooKeeper's Atomic Broadcast Protocol: Theory and Practice — for the db-19 transition.

Cross-lab dependencies

  • Upstream: db-16 distributed-fundamentals (Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale).
  • Downstream:
    • db-18 Paxos — reuses the heap-and-tick simulator; different RPC structure; weaker leader assumption.
    • db-19 ZAB — leader-based atomic broadcast; same election + log-replication skeleton.
    • db-20 Distributed KV — wraps a chosen consensus engine (probably this one) around a key-value state machine.
    • db-23 Capstone — joint membership changes and snapshots get added on top of this code.

db-17 — Analysis

Required invariants

  1. Election safety. At most one leader per term. Enforced by majority voting: a candidate only becomes leader after collecting votes from a strict majority, and each voter only grants one vote per term (the voted_for field, persisted in the canonical dump).

  2. Leader append-only. A leader never overwrites or deletes entries from its own log; it only appends. Followers may truncate on an AppendEntries consistency mismatch, but the leader's local log only grows.

  3. Log matching property. If two logs contain an entry with the same (index, term), then the logs are identical in all entries up through that index. Enforced by the prev_log_index / prev_log_term check in AppendEntries and the truncate-on-conflict rule.

  4. Leader completeness. If an entry is committed in term T, that entry is present in the log of every leader for all later terms. Enforced by the election restriction (a vote is only granted if the candidate's log is at least as up-to-date as the voter's).

  5. State machine safety. If a node has applied an entry at index i, no other node will ever apply a different entry at i. This follows from log matching + leader completeness + the commit-only-current-term rule.

  6. Byte determinism. For every (seed, nodes, rounds, proposals, partition) tuple, the three binaries produce identical canonical_dump bytes — hence identical sha256 hex on stdout. scripts/cross_test.sh checks six scenarios.

Design decisions

  • propose() calls advance_commit() at the end. The non-obvious one. In a single-node cluster the "leader" has no peers, so no AppendEntriesReply will ever arrive to drive advance_commit(). But a single-node cluster is its own majority, so the entry should commit the moment it is appended. Without this call, scenario D (--nodes 1) ends with commit_index = 0 despite five proposals in the log. Calling advance_commit() is harmless for n > 1 (the loop's majority check rejects until replies actually arrive).

  • Sorted iteration on every wire-affecting loop. Rust uses BTreeMap<u32, u64> for next_index / match_index; C++ uses std::map; Go uses explicit for p := uint32(0); p < n; p++ loops. HashMap would compile and pass single-language tests but fail cross_test.sh immediately. db-16's analysis.md called this out; db-17 enforces it across more code surface.

  • In-flight heap ordered by (delivery_time, sender, seq). seq is a global monotonic counter incremented every time a message is enqueued. It exists only to break ties when two messages with the same (delivery_time, sender) would otherwise be ambiguously ordered. Without seq you would see byte diffs on dense traffic at the same delivery tick.

  • Leader-pick for proposal injection is (max term, min id) among role == Leader nodes. During leadership churn there may be no leader, or there may be multiple stale leaders that have not yet stepped down. The (max term, min id) rule produces a deterministic routing decision no matter which language's iteration order you start from.

  • Proposal schedule is closed-form. schedule[i] = (i+1) * rounds / (K+1) (integer division). This places K proposals evenly through the rounds window, independent of cluster behaviour. A schedule derived from cluster state ("propose whenever there's a leader") would couple proposal timing to incidental scheduling choices and produce noisy byte diffs.

  • Splitmix64 constants are explicit. 0x9E3779B97F4A7C15 (γ / golden-ratio fractional, the seeder constant), 0xBF58476D1CE4E7B5 and 0x94D049BB133111EB (Vigna's two mixer constants). All three implementations copy them as literals; nobody computes them.

  • Library + thin CLI. The lab exposes Cluster::new, run, canonical_dump, and sha256 as a library. The CLI is a few dozen lines of arg parsing plus four function calls. Downstream labs (db-18 Paxos, db-20 distributed-kv) will link the library, not shell out.

Tradeoffs worth flagging

  • No snapshots, no log compaction. Logs grow without bound across the run. For --rounds 2000 --proposals 20 you end up with ~20 entries per node; the canonical dump stays small. For production Raft you would add a SnapshotState RPC and a last_included_index / last_included_term; deferred to db-21 (storage-engine-advanced).

  • No pre-vote, no leader lease. A network-partitioned candidate will repeatedly increment its term, then on heal will force the legitimate leader to step down. Mitigated by tight election timeouts in this simulator but a real cluster needs the pre-vote optimization (Ongaro thesis §9.6).

  • No membership changes. The node count is fixed at Cluster::new time. Joint consensus (and the safer learner-then-promote alternative) is a major chapter on its own; deferred to db-23 capstone.

  • Crash semantics are stylized. Crashes are simulated only via the partition flag (drop all messages in one direction). A real Raft must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction.

  • No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's pending queue is drained unconditionally; we are testing the consensus core, not the client RPC layer.

Why three languages

Same reasoning as db-16, plus one new lesson: Raft has many places where "iterate over peers" appears. Each one is a chance for a byte diff. C++'s std::map and Rust's BTreeMap are ordered by default; Go's map is explicitly randomized at iteration time. The Go implementation has explicit for p := uint32(0); p < n; p++ loops everywhere a peer iteration appears. Discovering this discipline by forcing the cross-language test to pass is more durable than reading "don't use HashMap" in a style guide.

db-17 — Execution

One-shot: prove the lab works

cd db-17-raft
./scripts/verify.sh        # all unit tests in Rust, Go, C++
./scripts/cross_test.sh    # byte-identical sha256 across all three, six scenarios

A green run of cross_test.sh ends with the literal line:

=== ALL OK ===

Per-language workflows

Rust

cd src/rust
cargo test --release       # ~10 tests
cargo build --release      # produces target/release/raftctl
./target/release/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Go

cd src/go
go test ./...              # ~12 tests
go build -o /tmp/raftctl_go ./cmd/raftctl
/tmp/raftctl_go --seed 42 --nodes 3 --rounds 1000 --proposals 5

C++

cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build   # test_db17 → "db-17 C++ tests: 10 passed"
./build/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

CLI

All three binaries accept the same flags and print lowercase hex sha256 of the canonical dump to stdout with no trailing newline:

flagdefaultmeaning
--seed N0splitmix64 seed mixed into election timers and message delays
--nodes K3number of Raft nodes (1 is legal; majority is then 1)
--rounds R1000number of simulator ticks to run
--proposals P0number of client commands to inject during the run
--partition s,d,...nonecomma-separated pairs (src, dst) to drop in that direction

--partition 0,1,1,0 drops both directions between nodes 0 and 1 (complete split); --partition 0,1 drops only 0 → 1 (asymmetric). Proposals are spaced as schedule[i] = (i+1) * rounds / (K+1); with --rounds 1000 --proposals 5 they fire at ticks 166, 333, 500, 666, 833.

Canonical scenarios

scripts/cross_test.sh runs six scenarios; their sha256s are listed in docs/observation.md. If any change, cross_test.sh will diff the raw dumps and exit non-zero.

labelargs
A--seed 42 --nodes 3 --rounds 1000 --proposals 5
B--seed 7 --nodes 5 --rounds 2000 --proposals 20
C--seed 99 --nodes 3 --rounds 500 --proposals 0
D--seed 1 --nodes 1 --rounds 200 --proposals 5
E--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0
F--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1

D exercises the single-node-leader code path that motivated the propose() → advance_commit() call. E isolates node 0 completely; the other two must elect a leader and commit the remaining proposals. F is an asymmetric partition that causes term churn but recoverable replication.

Sanity checks

# magic bytes of the canonical dump (use the lib directly; the CLI hashes it)
cat <<'EOF' | cargo run --quiet --example dump_magic
EOF
# or just trust the test: TestCanonicalDumpMagic in raft_test.go
# or for C++:   test_db17 prints "canonical dump magic OK" among its asserts

# pick any scenario and round-trip:
./src/rust/target/release/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
# expect:  a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9

db-17 — Observation

What the cross-language test produces and how to read it by hand.

Expected sha256s

scripts/cross_test.sh runs six scenarios and asserts the three binaries (Rust, Go, C++) all print the same hex digest. The current canonical hashes are:

labelargssha256
A--seed 42 --nodes 3 --rounds 1000 --proposals 5a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9
B--seed 7 --nodes 5 --rounds 2000 --proposals 20b6dc06aee72e595f51bd5045ea7c92ffcbe7f6fda3198985f9ded1eca2671c4b
C--seed 99 --nodes 3 --rounds 500 --proposals 0f9db9ea7e6c1ca2b3a911b42b2431e964a4ee7c5e40e27efd29b41e747958838
D--seed 1 --nodes 1 --rounds 200 --proposals 5ce8b8e05d6ad0b4a243753a934b2f052c2363e97beca0c175586677d1a489408
E--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0b1689eb48b209187b7cd82a24b1a6a2d19b0be4b481ac1a5b4f1ac9e23a6ae05
F--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1fcc70ecabe37509133bb27155f5bd7d74981c3f98e79719e2b47077acca6a31f

If any of these change, cross_test.sh will fail; either you have a bug, or you have intentionally changed the spec (timer constants, schedule formula, dump layout) and you must update this table in the same commit.

What the canonical dump looks like (scenario D — single node)

--seed 1 --nodes 1 --rounds 200 --proposals 5. Five proposals into a single-node cluster — leader is itself the majority, so every proposal commits immediately.

offset 0x00 :  44 53 45 52 41 46 54 31    "DSERAFT1"        magic
offset 0x08 :  01 00 00 00                 1                 node_count
offset 0x0c :  00 00 00 00                 0                 node id
offset 0x10 :  ?? ?? ?? ?? ?? ?? ?? ??     current_term      (~1, the first self-election)
offset 0x18 :  00 00 00 00 00 00 00 00     voted_for = 0     (voted for self in term 1)
offset 0x20 :  02                          role = Leader (2)
offset 0x21 :  05 00 00 00 00 00 00 00     commit_index = 5
offset 0x29 :  05 00 00 00                 log_len = 5
offset 0x2d :  XX XX XX XX XX XX XX XX     log[0].term       (== current_term)
offset 0x35 :  03 00 00 00                 log[0].cmd_len    (3 bytes: "p00")
offset 0x39 :  70 30 30                    "p00"             payload
...

Each subsequent entry is 8 + 4 + 3 = 15 bytes (term + cmd_len + "pNN"). Total dump for D is therefore approximately 0x2d + 5 * 15 = 0xa0 bytes = 160 bytes. The actual numbers vary slightly depending on how many election cycles --seed 1 produces before the first self-vote.

A multi-node dump (scenario C — quiet cluster)

--seed 99 --nodes 3 --rounds 500 --proposals 0. No proposals; the cluster elects a leader, sends heartbeats, and that is it. Every node's log is empty:

44 53 45 52 41 46 54 31         magic
03 00 00 00                     node_count = 3

00 00 00 00                     node id 0
XX XX XX XX XX XX XX XX         current_term       (1 if 0 elected itself, otherwise higher)
XX XX XX XX XX XX XX XX         voted_for           (0 for the leader, otherwise the leader id)
XX                              role                (Leader or Follower; never Candidate at quiescence)
00 00 00 00 00 00 00 00         commit_index = 0
00 00 00 00                     log_len = 0

01 00 00 00                     node id 1
... same shape ...

02 00 00 00                     node id 2
... same shape ...

Total dump: 8 + 4 + 3 * (4 + 8 + 8 + 1 + 8 + 4) = 111 bytes.

How to debug a divergence

If cross_test.sh fails, the script captures the raw dump from each language into /tmp/raft_<label>_<lang>.bin and prints which two languages diverged. Then:

cmp -l /tmp/raft_A_rust.bin /tmp/raft_A_go.bin | head
xxd /tmp/raft_A_rust.bin | sed -n '<line>,+2p'
xxd /tmp/raft_A_go.bin   | sed -n '<line>,+2p'

The first divergence offset tells you what to look at:

offset rangelikely culprit
0x00–0x07magic (typo: DSERAFT1 not DESRAFT1)
0x08–0x0bnode_count (impossible if all three accept --nodes correctly)
inside a node block, on current_termelection timer or heap-order bug
inside a node block, on voted_forNone encoding (must be i64 LE -1)
inside a node block, on roleenum mapping (Follower=0, Candidate=1, Leader=2)
inside a node block, on commit_indexpropose() not calling advance_commit(), or quorum count wrong
inside a log entryAppendEntries truncate-on-conflict bug, or peer iteration order

In all six existing scenarios these checks pass; the table above is the runbook for the day someone changes the algorithm and forgets to update one of the three implementations.

Tick-level scope (Rust REPL trick)

To watch a scenario from the inside, add this temporary print in Cluster::run before the simulator loop:

#![allow(unused)]
fn main() {
if std::env::var("RAFT_TRACE").is_ok() {
    eprintln!("t={} leader={:?} terms={:?}", t,
        self.nodes.iter().find(|n| n.role == Role::Leader).map(|n| n.id),
        self.nodes.iter().map(|n| n.current_term).collect::<Vec<_>>());
}
}

then run RAFT_TRACE=1 raftctl --seed 42 --nodes 3 --rounds 1000 ... | head -50. The output is not part of the canonical dump and does not affect the sha256. Remove before commit.

db-17 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

  • macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
  • cmake ≥ 3.20.
  • Rust toolchain ≥ 1.74.
  • Go ≥ 1.22.
  • shasum, cmp, awk (default on macOS; coreutils on Linux).

One command

cd db-17-raft
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match across six scenarios

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-17-raft/src/rust
cargo test --release --quiet
cargo build --release

Expected: ~10 tests pass. The raftctl binary lands at target/release/raftctl. The release profile uses LTO.

Go

cd db-17-raft/src/go
go test ./...
go build -o /tmp/raftctl_go ./cmd/raftctl

Expected: ok github.com/10xdev/dse/db17 <duration> and a working binary. Tests include TestSha256HexKnownVectors (validates the SHA-256 wrapper against published vectors) and TestVotedForNegativeEncoding (validates the -1 sentinel byte layout).

C++

cd db-17-raft/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and test_db17 prints "db-17 C++ tests: 10 passed". The test source #undef NDEBUG before <cassert> so assert() fires in Release builds too.

What "green" means

A green run guarantees:

  1. Per-language unit tests pass. Each implementation independently exercises splitmix64, election-timer reset, RequestVote granting, AppendEntries truncation, single-node commit, multi-node commit, canonical dump magic + node_count + log_len framing, and the SHA-256 implementation against a known test vector.

  2. All six scenarios produce byte-identical canonical dumps across Rust, Go, and C++. cross_test.sh actually compares the raw dump bytes (cmp -s) before comparing the sha256 hex, so a divergence is caught with an exact byte offset rather than just "the hashes don't match".

  3. The sha256s match the table in docs/observation.md. If you change the algorithm or the dump format, both the dumps and the table must change in the same commit. The mismatch between code and docs is itself a verification failure.

What "green" does NOT guarantee

  • No production safety. There is no fsync; in-memory state is considered durable by construction.
  • No coverage of snapshot / membership / pre-vote / lease code. Those features are deferred to db-21, db-23, and possibly never (this lab is a study lab, not a production engine).
  • No client-facing API. Proposals are injected into the simulator via a fixed schedule; there is no Propose RPC for an external client.
  • No performance characterization. The lab is sized to run in under a second per scenario; multi-thousand-round runs work but are not the goal.

Invariant assertions in code

The C++ test file in particular makes the invariants concrete:

assertioninvariant
dump.size() >= 12 and starts with "DSERAFT1"dump-format magic
read_u32_le(dump, 8) == nodesnode_count framing
cluster.run(...) does not panic for any tested (seed, nodes, rounds, P)no out-of-bounds / no UB
sha256(empty) == e3b0c44298...SHA-256 padding boundary case
n.commit_index <= n.log.len() for every node after runno over-commit
propose on a single-node leader yields commit_index == proposalsmajority-of-one rule

The Rust and Go tests assert the same set in their respective testing idioms.

db-17 — Broader Ideas

Where Raft and the choices in this lab show up in real systems, and what to build on top of the same simulator harness in the rest of the distributed track.

Immediate next labs

  • db-18 — Paxos. Same heap-and-tick harness, different RPC structure (Prepare / Promise / Accept / Accepted), no fixed leader. Paxos's invariants are notoriously hard to reason about by hand; byte-deterministic replay of a counterexample seed is the difference between "I think it's correct" and "I have evidence". Raft was literally designed as the more understandable alternative — implementing both in this order is the recommended path.

  • db-19 — ZAB. ZooKeeper's atomic broadcast protocol. Similar leader-based skeleton to Raft, but the recovery phase is more involved (NEWLEADER / NEWEPOCH / SYNC / BROADCAST). The Lamport scalar of db-16 generalizes to the (epoch, counter) pair that ZAB calls a "zxid".

  • db-20 — Distributed KV. Wrap a quorum-replicated key-value store around a chosen consensus engine. The state machine is the only thing that changes — the consensus log feeds Put(k, v) / Delete(k) commands that get applied in commit_index order.

  • db-23 — Capstone. Adds snapshots, joint-consensus membership changes, and a multi-Raft "shards across regions" deployment on top of this code.

How this lab's pieces map to real systems

  • The Raft skeleton implemented here is exactly what etcd, Consul, TiKV, CockroachDB, MongoDB metadata, OpenStack Nova cells, and the control plane of Vault all run. They each add the extensions deferred from this lab (pre-vote, snapshots, learners, joint consensus), but the core RequestVote / AppendEntries loop is unchanged.

  • The (delivery_time, sender, seq) heap tie-break is the same trick FoundationDB's simulator uses to drive every commit-proxy /storage-server interaction; TigerBeetle, Antithesis, and Hermitian all reach for it independently.

  • The "leader picks max-term, min-id" convention surfaces as the split-brain resolution rule in production systems: when a partition heals and you see two leaders, the one with the higher term wins unconditionally (id break is academic — different terms imply different elections).

  • The voted_for = None encoded as -1 is the convention every Raft implementation in production uses on disk. Some encode as optional / nullable types in a richer wire format (protobuf has optional), but in any fixed-width binary log the sentinel value is the right answer.

Performance experiments worth running later

  • Crank --rounds to 100k and watch the binary size grow. The dump is linear in committed entries; if you ever see super-linear growth something is appending entries that don't get committed (a sign of partition oscillation).

  • Replace splitmix64 with a per-node rand::ChaCha20. The simulator will still be deterministic (RNGs are seeded), but cross-language byte equivalence will break unless you also port the ChaCha core identically. Useful exercise in what exactly portability requires.

  • Try injecting one heavy proposal vs. many small proposals into a 3-node cluster and measure the cluster dump size vs. the bytes actually committed. The difference is the steady-state replication overhead.

  • Vary the election timeout. The 150 + jitter(0..150) ticks chosen here keeps churn low; halve it and you'll see term numbers climb rapidly under any partition, especially scenario F.

What "production-quality" would require beyond this lab

  • A real disk-backed persistence layer with fsync semantics and crash recovery. The canonical dump pretends current_term, voted_for, and log are durable on every change; a real Raft must fsync before sending any reply that depends on the new state, or risk violating election safety on a power cut.

  • Network I/O. The simulator hands typed structs across an in-process heap; production uses gRPC or a custom framing protocol with at- least-once delivery and connection-level back-pressure.

  • Pre-vote and leader leases. Without them, a partitioned candidate bumps its term repeatedly; on heal the legitimate leader steps down unnecessarily. Easy to add as a wrapper on RequestVote; deferred here because it would obscure the core algorithm.

  • Snapshots and log compaction. Without them, the log grows forever and a slow follower can't catch up over the wire. The canonical dump tolerates this only because the lab's rounds is bounded.

  • Membership changes. The fixed nodes count at Cluster::new time is fine for a lab but useless in production. Joint consensus or the safer learner-then-promote protocol are major additions; covered in db-23.

  • Observability. A real Raft cluster exposes per-node term, commit_index, match_index[*], leader_id, election counts, and message rates as metrics. The canonical dump is a post-mortem view; runtime observability is a separate problem.

db-17 step 01 — Leader election

Goal

A cluster of nodes followers, started cold, must elect exactly one leader in a bounded number of ticks, and that leader must remain stable as long as it can deliver heartbeats. The election protocol must be byte-deterministic across Rust, Go, and C++.

Tasks

  1. Persistent state. Each RaftNode carries current_term: u64, voted_for: Option<u32>, and log: Vec<LogEntry>. The dump encodes voted_for=None as the signed integer -1 (i64 LE); Some(id) becomes id as i64.

  2. Election timer. reset_election_timer(t) sets election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150. Heartbeat-due is t + 50.

  3. on_tick(t). Followers and candidates that hit election_deadline start a new election: bump current_term, vote for self, broadcast RequestVote to all peers, transition to Candidate. Leaders that hit heartbeat_due broadcast an empty AppendEntries (heartbeat).

  4. RequestVote handling. Grant a vote iff (a) term == current_term, (b) voted_for is None or equal to the candidate, and (c) the candidate's log is at least as up-to-date as ours (the standard last_log_term/last_log_index lex compare). Grant resets the election timer.

  5. RequestVoteReply handling. A candidate that collects a majority of granted replies in the same term transitions to Leader, initializes next_index[p] = log.len() and match_index[p] = 0 for every peer p, and immediately broadcasts AppendEntries (initial heartbeat).

  6. become_follower(term). Used whenever a node sees term > current_term (in any RPC). Sets current_term = term, clears voted_for, resets the election timer, transitions to Follower.

Acceptance

Inline unit tests in each language:

  • splitmix64_known_vectorssplitmix64(0) == 0xE220A8397B1DCDAF (the value Vigna's reference C produces).
  • election_timer_in_range — 1000 consecutive resets all land in [t+150, t+300).
  • request_vote_grants_first_only — vote for candidate A, then a RequestVote from B in the same term is denied.
  • become_leader_from_majority — 3-node cluster, two RequestVoteReply with granted=true transitions the candidate to Leader.
  • term_bump_demotes_leader — a Leader receiving any RPC with term > current_term becomes Follower and clears voted_for.

All five green in Rust, Go, and C++.

Discussion prompts

  • Why is voted_for persistent (in the canonical dump) but commit_index volatile (also dumped, but only because the dump is a debug oracle, not a recovery file)?
  • What goes wrong if you reset the election timer on send of RequestVote instead of on grant of someone else's vote? (Hint: split-vote loops.)
  • Why must "majority" be computed against nodes, not against nodes that have replied?

db-17 step 02 — Log replication

Goal

A leader accepts client proposals, replicates them to followers via AppendEntries, and advances commit_index once a majority's match_index covers the entry and the entry is from the leader's current term. Followers truncate any conflicting suffix and append the leader's entries. The result must be byte-deterministic across all three languages.

Tasks

  1. LogEntry. { term: u64, command: Vec<u8> }. Logs are 0-indexed in this implementation; the algorithm description uses 1-indexed in Ongaro's Figure 2 — adjust mentally when reading the paper.

  2. propose(cmd). Leader-only:

    • push LogEntry { term: current_term, command: cmd } onto own log,
    • set match_index[self] = log.len(),
    • broadcast AppendEntries to all peers,
    • call advance_commit() (so n=1 commits immediately).
  3. broadcast_append_entries(). For each peer in ascending id order, send AppendEntries { term, leader, prev_idx, prev_term, entries: log[next_index[p]..], leader_commit }. prev_idx = next_index[p], prev_term = log[prev_idx-1].term (or 0 if prev_idx == 0).

  4. AppendEntries handling on follower.

    • if term > current_term: become_follower(term);
    • if term < current_term: reply success=false;
    • reset election timer (we heard from a leader);
    • if prev_idx > 0 && (log too short || log[prev_idx-1].term != prev_term): reply success=false, match_index=0;
    • else: walk each incoming entry; truncate own log at the first (index, term) conflict; append remaining entries; advance commit_index = min(leader_commit, log.len()); reply success=true, match_index=prev_idx+entries.len().
  5. AppendEntriesReply handling on leader.

    • if term > current_term: become_follower(term);
    • if success: next_index[from] = reply.match_index + 1; match_index[from] = reply.match_index; advance_commit();
    • if !success and term == current_term: decrement next_index[from] (clamped at 0); next heartbeat / propose will retry with an earlier prev_idx.
  6. advance_commit(). For N from log.len() down to commit_index + 1:

    • if log[N-1].term != current_term: continue (Figure 8 safety);
    • if 1 + count(p : match_index[p] >= N) > nodes / 2: set commit_index = N and break.

Acceptance

Inline unit tests in each language:

  • propose_single_node_commits--nodes 1, propose 3 entries, every entry's term is the leader term, commit_index == 3.
  • append_entries_rejects_term_mismatch — leader with empty log sends AE with prev_idx=5, prev_term=1; follower returns success=false.
  • append_entries_truncates_conflict — follower with log=[(t=1), (t=1), (t=2)] receives AE with prev_idx=2, prev_term=1, entries=[ (t=3)]; resulting log is [(t=1), (t=1), (t=3)].
  • commit_requires_current_term — leader at term=5 replicates an old term=3 entry to all followers; commit_index does NOT advance past it until the leader appends a term=5 entry that also reaches majority.
  • quorum_commit_three_nodes — 3-node cluster, leader proposes one entry, one follower acks; commit_index advances (2 of 3 is a majority including the leader).

All five green in Rust, Go, and C++.

Discussion prompts

  • The Figure 8 commit restriction ("commit only entries of the current term") is famously subtle. Construct a 3-node scenario where omitting it lets a leader commit an entry that a future leader's election can erase.
  • Why does the leader update match_index[self] after propose? (Otherwise the majority check would always exclude the leader.)
  • What happens if two leaders coexist briefly (network partition that has not yet healed)? Specifically: which leader can advance commit_index, and why is this safe?

db-17 step 03 — Cross-test and partition

Goal

A Cluster that drives an n-node simulation forward by integer ticks, a --partition CLI flag that drops messages in named directions, and a cross-language scripts/cross_test.sh proving the canonical dump's sha256 is byte-identical across Rust, Go, and C++ for six seeded scenarios including partitions.

Tasks

  1. Cluster::new(seed, nodes). Holds:

    • nodes: Vec<RaftNode> (ids 0..nodes);
    • drop: BTreeSet<(u32, u32)> (directional message-drop set);
    • heap: BinaryHeap<InFlight> ordered by (delivery_time, sender, seq)InFlight implements Ord such that BinaryHeap behaves as a min-heap;
    • seq: u64 (global monotonic);
    • pending_proposals: VecDeque<Vec<u8>>.
  2. Cluster::run(rounds, n_proposals). For each tick t in 0..rounds:

    1. Enqueue scheduled proposals. schedule[i] = (i+1) * rounds / (n_proposals + 1); if t == schedule[i], push payload "p<i:02>" onto pending_proposals.
    2. Inject pending into current leader. Find leader as the (max current_term, min id) node with role == Leader; while pending_proposals is non-empty and a leader exists, drain one payload and call leader.propose(payload). The propose pushes RPCs onto the heap with delivery times computed from splitmix64(seed ^ src ^ dst ^ t) % 3 + 1.
    3. Deliver. Pop every InFlight whose delivery_time == t. For each, if (sender, dest) is in drop, discard. Otherwise call nodes[dest].handle(rpc, t) and enqueue any reply RPCs the handler produces.
    4. Tick. Iterate nodes in ascending id; call node.on_tick(t) on each; enqueue any RPCs produced.
  3. canonical_dump(&cluster) -> Vec<u8>. As specified in CONCEPTS.md: magic "DSERAFT1" (8 bytes), u32_le(node_count), then for each node in id order: id, current_term, voted_for (i64 LE, -1 for None), role (u8), commit_index, log_len, and each entry's (term, cmd_len, cmd_bytes).

  4. raftctl CLI. Parses --seed, --nodes, --rounds, --proposals, --partition s,d,s,d,.... Calls Cluster::new, inserts every (s, d) pair into cluster.drop, runs, dumps, sha256s, prints lowercase hex with no trailing newline.

  5. scripts/cross_test.sh. For each of the six scenarios (A–F in docs/observation.md), invoke all three binaries with the same args, compare raw dumps with cmp -s, then compare hex hashes. Print the scenario label and OK on success, or the diverging offset and the three hashes on failure. End with === ALL OK ===.

Acceptance

  • cargo test --release ⇒ ~10 tests pass.
  • go test ./... ⇒ ~12 tests pass.
  • ctest --test-dir build100% tests passed.
  • ./scripts/verify.sh=== OK ===.
  • ./scripts/cross_test.sh ⇒ all six scenarios OK, final === ALL OK ===.
  • The exact sha256s match docs/observation.md's table. Specifically scenario A is a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9.

Discussion prompts

  • The proposal-injection step picks the leader by (max term, min id). Why not "first leader found in iteration order"? (Hint: Go's map iteration is randomized; (max term, min id) is content-defined.)
  • Scenario E (--partition 0,1,0,2,1,0,2,0) drops every message into or out of node 0. What is the only way the resulting log can contain committed entries? Trace which two-node sub-cluster achieves quorum.
  • Scenario F is an asymmetric partition (0 → 1 only). Why doesn't this cause permanent leadership churn? (Hint: node 1 can still reach node 0 via AppendEntriesReply.)
  • If you swap BTreeSet for HashSet in Cluster::drop (Rust), the hashes still match — why? But if you swap BTreeMap for HashMap in RaftNode::next_index, they don't. Articulate the rule.

db-18 — Paxos

This lab implements Multi-Paxos consensus in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It is the sibling of db-17 (Raft) and reuses db-16's deterministic simulator discipline: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule, same closed-form proposal schedule.

If db-17 taught you that one consensus algorithm can be expressed identically in three languages, db-18 teaches you that another consensus algorithm — built on different primitives, with no built-in leader concept, and capable of arbitrary concurrent proposers — can be held to the very same bit-level discipline. The two implementations share zero algorithmic code but share all of the determinism machinery, and that is the point.


What is it?

Paxos (Lamport, "The Part-Time Parliament" 1998 / "Paxos Made Simple" 2001) is the original asynchronous consensus algorithm: a family of acceptors collectively decides on a single value per slot despite crashes, message loss, and message reordering. Unlike Raft, Paxos has no first-class leader and no current_term. Its only ordering primitive is the ballot — a lexicographic pair (round, proposer_id) that acceptors monotonically promise to honor.

Single-decree (one-slot) Paxos has two phases:

  1. Phase 1 — Prepare / Promise. A proposer picks a fresh ballot b and broadcasts Prepare(b). An acceptor whose previously promised ballot is ≤ b updates promised := b and replies with every prior accept it holds (each (slot, accepted_ballot, value) triple). On collecting promises from a majority, the proposer enters Phase 2.

  2. Phase 2 — Accept / Accepted. For each slot, the proposer picks the value to propose: if any promise returned a prior accept for that slot, it must re-propose the value with the highest accepted_ballot (Lamport's P2c invariant); otherwise it is free to propose its own client value. It broadcasts Accept(b, slot, v). An acceptor whose promised ballot is ≤ b records accepted[slot] := (b, v) and replies Accepted(b, slot). On collecting accepts from a majority, the proposer declares the slot decided and broadcasts Decided(slot, v) to anyone who didn't get the accept.

Multi-Paxos amortizes Phase 1 across many slots. The proposer who "wins" Phase 1 acts as a distinguished proposer (lab-locally we call this role Leader) and reuses its promised ballot to drive Phase 2 for every subsequent slot, paying the Phase-1 cost only once per ballot. Liveness is preserved by election timeouts: an acceptor that hasn't heard from a leader for ≥ ELECTION_TIMEOUT_MIN + jitter ticks starts its own Phase 1 with a higher round.

This lab implements Multi-Paxos end-to-end. It is the algorithm behind Google Chubby, Google Spanner's paxos groups, Cassandra lightweight transactions, and (in spirit) Apache ZooKeeper's ZAB.


Why does it matter?

  • Paxos is the historical and theoretical root of asynchronous consensus. Raft, ZAB, Viewstamped Replication, and EPaxos are all reactions to or refinements of Paxos. Reading the paper is easier when you have made the algorithm bit-deterministic with your own hands.

  • No fixed leader means no "single term" to lean on. Raft's safety flows largely from "exactly one leader per term". Paxos has neither. Its safety flows from the much weaker quorum-intersection argument: any two majorities of an n-node cluster share at least one acceptor, and that acceptor's promised-ballot ordering serializes every accept that could possibly decide a slot. Writing the algorithm in three languages, watching the same sha256 fall out, and then deliberately breaking the quorum (scenario E) is the most visceral way to internalise quorum intersection.

  • Concurrent proposers are first-class. Paxos lets every node attempt Phase 1 at any time. Dueling proposers are not an error case; they are the normal case during leadership churn. The deterministic simulator lets you replay the exact tick at which two proposers tied, see which ballot won, and confirm the safety invariants held without any "leader lease" magic.

  • Foundation for the rest of the distributed track. db-19 (ZAB) layers epoch+counter on top of a paxos-ish core; db-20 (distributed KV) feeds Paxos accept-decisions into a key-value state machine; db-23 (capstone) introduces snapshots and reconfiguration on top of whichever consensus engine the student picks (Raft, Paxos, or both).


How does it work?

State (per node)

acceptor    : promised_ballot : Ballot                # global, not per-slot
              accepts         : Map<slot, (Ballot, Vec<u8>)>
learner     : learned         : Map<slot, Vec<u8>>
proposer    : role            : Follower | Candidate | Leader
              my_ballot       : Ballot                # the ballot this node is driving
              prepare_promises: Set<acceptor_id>      # accumulated this election
              prepare_accepted: Map<slot, (Ballot, Vec<u8>)>  # recovered during Phase 1
              accept_count    : Map<slot, Set<acceptor_id>>
              next_slot       : u64                   # next fresh slot to propose
              pending         : Deque<Vec<u8>>        # queued client values
timers      : election_deadline   : u64               # sim-time tick
              last_heartbeat_sent : u64

promised_ballot is global per node (covers every slot, present and future) — this is the standard Multi-Paxos optimization. accepts is per-slot, because each slot is its own single-decree instance. learned is the per-slot decision; once set it never changes.

Ballot ordering

#![allow(unused)]
fn main() {
#[derive(Clone, Copy, Eq, PartialEq)]
struct Ballot { round: u32, proposer_id: u32 }
}

Lex order on (round, proposer_id). Ballot::ZERO = (0, 0) means "no ballot" and compares less than every other ballot. Promotion of promised_ballot is monotonic: once an acceptor has promised b, it will never accept any RPC carrying a strictly lower ballot.

Election timer (liveness)

reset_election_deadline(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

Identical to db-17's election timer. Heartbeats fire every 50 ticks from the current leader to keep follower timers refreshed.

Phase 1 — Prepare / Promise

start_election(t):
    role = Candidate
    new_round = max(promised_ballot.round, my_ballot.round) + 1
    my_ballot = Ballot { round: new_round, proposer_id: self.id }
    prepare_promises = { self.id }                  # self-promise
    prepare_accepted = { slot: (ab, v) | (slot, (ab, v)) in self.accepts }
    if my_ballot >= promised_ballot:
        promised_ballot = my_ballot                 # we promise ourselves too
    broadcast(Prepare { ballot: my_ballot })
    if |prepare_promises| >= quorum():              # n = 1 cluster
        become_leader(t)

on Prepare(b) at acceptor:
    if b >= promised_ballot:
        promised_ballot = b
        if role in {Candidate, Leader} and b > my_ballot:
            step_down(t)                            # higher proposer takes over
        reset_election_deadline(t)
        send Promise(b, accept_ok=true,
                     accepted = sorted_by_slot(accepts),
                     acceptor_id = self.id) → b.proposer_id
    else:
        send Promise(b, accept_ok=false, acceptor_id=self.id) → b.proposer_id

on Promise(b, ok, accepted, from) at candidate:
    if role != Candidate or b != my_ballot: drop
    if not ok: step_down(t); return                 # someone outranks us
    prepare_promises.insert(from)
    for (slot, ab, v) in accepted:
        if slot not in prepare_accepted or ab > prepare_accepted[slot].ballot:
            prepare_accepted[slot] = (ab, v)        # recover highest-ballot value
    if |prepare_promises| >= quorum():
        become_leader(t)

The recovery rule take if ab > current.ballot is the operational form of Lamport's P2c: across any majority of acceptors, the value with the highest accepted ballot for a slot is the only value that could already be decided in that slot, so the new leader must keep proposing it (or anything if no acceptor reports a prior accept).

Phase 2 — Accept / Accepted

become_leader(t):
    role = Leader
    # Re-issue Accepts under our ballot for every recovered slot.
    for slot in sorted(prepare_accepted.keys):
        if slot in learned: continue
        value = prepare_accepted[slot].value
        accepts[slot] = (my_ballot, value)
        accept_count[slot] = { self.id }
        broadcast(Accept { ballot: my_ballot, slot, value })
    next_slot = 1 + max(any seen slot in accepts ∪ learned, or -1)
    last_heartbeat_sent = t
    broadcast(Heartbeat { ballot: my_ballot })
    drain_pending(out)

drain_pending():
    while pending is non-empty:
        value = pending.pop_front()
        slot = next_slot; next_slot += 1
        accepts[slot] = (my_ballot, value)
        accept_count[slot] = { self.id }
        broadcast(Accept { ballot: my_ballot, slot, value })
        try_decide(slot)                            # n=1 cluster

on Accept(b, slot, v) at acceptor:
    if b >= promised_ballot:
        promised_ballot = b
        accepts[slot] = (b, v)
        if role in {Candidate, Leader} and b > my_ballot:
            step_down(t)
        reset_election_deadline(t)
        send Accepted(b, slot, ok=true, self.id) → b.proposer_id
    else:
        send Accepted(b, slot, ok=false, self.id) → b.proposer_id

on Accepted(b, slot, ok, from) at leader:
    if role != Leader or b != my_ballot: drop
    if not ok: step_down(t); return
    accept_count[slot].insert(from)
    try_decide(slot)

try_decide(slot):
    if role != Leader or slot in learned: return
    if |accept_count[slot]| >= quorum():
        v = accepts[slot].value
        learned[slot] = v
        broadcast(Decided { slot, value: v })

on Decided(slot, v) at any node:
    learned[slot] = v
    reset_election_deadline(t)

on Heartbeat(b) at node:
    if b >= my_ballot and role in {Candidate, Leader} and b.proposer_id != self.id:
        step_down(t)
    if b >= promised_ballot or (promised_ballot != ZERO and b == promised_ballot):
        reset_election_deadline(t)

Simulator loop (per tick t in 0..rounds)

1. enqueue scheduled proposals  — schedule[i] = (i+1) * rounds / (K+1)
2. drain cluster-pending values into the current leader (if any)
3. pop every in-flight msg with delivery_time <= t and dispatch handle()
4. tick all nodes in ascending id; on_tick may fire election or heartbeat

The leader-pick rule for proposal injection is "lowest-id node with role == Leader". During leadership churn there may be no leader (in which case the value waits in cluster_pending) or even two stale leaders (in which case the lowest id wins). The deterministic choice is what keeps the byte hash stable.

Wire format (Rpc)

Six variants; tagged-union shape in Go, Rust enum and C++ std::variant- backed types. All fields fixed-width, little-endian:

Prepare    { ballot: (round: u32, proposer_id: u32) }
Promise    { ballot, accept_ok: bool, acceptor_id: u32,
             accepted: [(slot: u64, accepted_ballot, value: Vec<u8>)] }
Accept     { ballot, slot: u64, value: Vec<u8> }
Accepted   { ballot, slot: u64, accept_ok: bool, acceptor_id: u32 }
Decided    { slot: u64, value: Vec<u8> }
Heartbeat  { ballot }

The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. The only thing that is serialized is the canonical dump, and that is what gets hashed.

Canonical dump format

file := magic[8 = "DSEPAX01"] u32_le(node_count) node*

node := u32_le id
        u32_le promised_ballot.round
        u32_le promised_ballot.proposer_id
        u8     role                       # Follower=0, Candidate=1, Leader=2
        u32_le my_ballot.round
        u32_le my_ballot.proposer_id
        u32_le accept_count
        accept * accept_count
        u32_le learned_count
        learned * learned_count

accept  := u64_le slot
           u32_le accepted_ballot.round
           u32_le accepted_ballot.proposer_id
           u32_le value_len
           u8 value[value_len]

learned := u64_le slot
           u32_le value_len
           u8 value[value_len]

Nodes appear in ascending id order; inside each node, both accepts and learned are emitted in ascending slot order. All multi-byte integers are little-endian. The dump is hashed with SHA-256 and the lowercase hex digest is what paxosctl prints to stdout (no trailing newline).

Cross-language invariants

InvariantWhy it matters
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EBidentical PRNG output across languages
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150identical election firing times
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3identical message scheduling
heap order (delivery_time, sender, seq); seq global monotonicidentical delivery sequence
peers iterated in ascending id (BTreeMap / std::map / explicit for p:=0;p<n;p++)identical broadcast order
acceptor's Promise lists prior accepts in ascending slot orderidentical Promise payload bytes
candidate's Phase-1 recovery rule: keep (ab, v) with strictly greater abidentical recovered value per slot
next_slot = 1 + max(seen accept slot ∪ seen learned slot) after winning Phase 1identical first fresh slot
try_decide quorum check uses ≥ n/2 + 1 (strict majority, leader counted)identical decide tick
leader-pick for proposal injection: lowest-id Leaderidentical client routing
proposal schedule: schedule[i] = (i+1) * rounds / (K+1) integer divisionidentical pending queue contents
Role enum order Follower=0, Candidate=1, Leader=2identical dump bytes
dump emits accepts and learned in ascending slot order; nodes in ascending id orderidentical dump bytes

Drift in any one of these and scripts/cross_test.sh fails. The companion cmp -l workflow in docs/observation.md walks you from "the hashes differ" to "this exact byte differs" in three commands.

Multi-Paxos vs. Raft (the comparison the labs exist to make)

DimensionRaft (db-17)Multi-Paxos (db-18)
ordering primitivecurrent_term: u64 (single integer, persisted, monotonic)Ballot { round, proposer_id } lex pair
leader conceptfirst-class; exactly one leader per termemergent; "leader" = whoever last won Phase 1
concurrent proposersforbidden by election safetyallowed (and routine during churn)
consistency checkprev_log_index / prev_log_term per AppendEntriesper-slot accepted_ballot carried in Promise
Phase-1 cost amortizationnone needed (single leader)Multi-Paxos (one Prepare covers all future slots)
safety fromlog matching + election restriction + commit-only-current-termquorum intersection + Promise reports prior accepts
understandabilitydesigned for clarity (Ongaro 2014)famously subtle (P2c, dueling proposers)

The lab implementations make these dimensions concrete: scenario A in db-17 takes ~166 ticks to commit a proposal (election + AE round trip); the equivalent scenario A here takes ~150 ticks for Phase 1 plus ~3 ticks per Accept, then the leader runs at Phase-2-only cost until somebody bumps it.


Files

  • src/rust/paxos18 crate + paxosctl binary.
  • src/go/ — module github.com/10xdev/dse/db18 + cmd/paxosctl.
  • src/cpp/db18_lib static library + paxosctl binary + test_db18.
  • scripts/verify.sh — builds + runs the unit tests for all three.
  • scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.

db-18 — References

Primary sources

  • Leslie Lamport, The Part-Time Parliament, ACM TOCS 1998. The original Paxos paper. Famously hard to read (the Parliament of Paxos allegory hides the algorithm). The mathematics in §2 is the spec; the rest is narrative. https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf
  • Leslie Lamport, Paxos Made Simple, ACM SIGACT News 2001. The paper to read first. The whole algorithm — single-decree and the Multi-Paxos extension — is on four pages. The P1a / P1b / P2a / P2b / P2c invariants in this paper are the ones whose operational forms the simulator enforces. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
  • Tushar Chandra, Robert Griesemer, Joshua Redstone, Paxos Made Live — An Engineering Perspective, PODC 2007. Google's Chubby team's writeup of what it took to turn the algorithm into a production system: leader leases, snapshots, group membership, disk corruption, the works. This lab implements roughly §2–§3 of that paper. https://research.google/pubs/paxos-made-live-an-engineering-perspective/
  • Robbert van Renesse & Deniz Altinbuken, Paxos Made Moderately Complex, ACM CSUR 2015. The most readable end-to-end derivation of Multi-Paxos. Pseudocode in §3 maps almost line-for-line onto this lab's start_election, become_leader, try_decide. https://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf
  • Heidi Howard, Distributed Consensus Revised, PhD dissertation, Cambridge 2019 (also A Generalised Solution to Distributed Consensus, 2020). Reframes Paxos as one point in a design space parameterised by quorum-intersection requirements; explains why Flexible Paxos works and how Raft, EPaxos, and Vertical Paxos all fit into the same picture. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf

Variants worth knowing

  • Leslie Lamport, Fast Paxos, Distributed Computing 2006. Allows a single round-trip happy path when only one proposer is active, at the cost of a 3f+1 quorum on the fast path. EPaxos generalises this.
  • Iulian Moraru, David Andersen, Michael Kaminsky, There Is More Consensus in Egalitarian Parliaments (EPaxos), SOSP 2013. Drops the leader entirely; each command picks its own dependency graph. Production-relevant in geo-distributed systems where any-leader latency is uneven.
  • Lamport, Malkhi, Zhou, Vertical Paxos, PODC 2009. Decouples reconfiguration from the consensus protocol — the answer to "how do you change the acceptor set without stopping the world".
  • Lamport, Generalized Paxos, MSR-TR-2005-33. Lets commutative commands be ordered concurrently; precursor to EPaxos.

Reference implementations to read alongside

Background reading worth doing

Cross-lab dependencies

  • Upstream:
    • db-16 distributed-fundamentals (Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale).
    • db-17 raft (sibling consensus algorithm; same harness, same canonical-dump discipline, different RPCs and safety arguments).
  • Downstream:
    • db-19 ZAB — leader-based atomic broadcast; the zxid = (epoch, counter) pair generalises this lab's Ballot.
    • db-20 Distributed KV — wraps a chosen consensus engine around a key-value state machine. Paxos and Raft are interchangeable plug-ins at that layer.
    • db-23 Capstone — adds snapshots, reconfiguration (Vertical Paxos or joint consensus), and multi-shard deployment.

db-18 — Analysis

Required invariants

If any of these is violated, scripts/cross_test.sh will fail, and in the worst case the algorithm itself is unsafe. They are stated in the order it is easiest to reason about them.

  1. Promise monotonicity (P1). For every acceptor and every sim-time tick t, promised_ballot[t] >= promised_ballot[t-1]. The simulator enforces this with a single comparison on each of Prepare, Accept: the message's ballot must be >= promised_ballot before any state mutation. The Promise reply's accept_ok bit is the operational form of P1b.

  2. Accept respects promise (P2a). No acceptor ever stores accepts[slot] = (ab, v) with ab < promised_ballot. The Accept handler short-circuits with accept_ok=false when b < promised_ballot; the leader interprets that bit and steps down instead of advancing accept_count.

  3. Per-slot accept uniqueness under a ballot (P2b). For a fixed slot s and a fixed ballot b, the value v that any acceptor stores under (s, b) is the same value. This holds trivially here because only the leader of ballot b ever sends Accept(b, s, v), and its accepts[s] is set once and never overwritten under its own ballot.

  4. P2c (the safety lemma that needs work). Suppose value v is chosen at slot s under ballot b. Then for any ballot b' > b issued by any proposer, the value field of any Accept(b', s, v') will satisfy v' == v. The mechanism: to issue an Accept at all, the proposer must have collected promises from a quorum at ballot b'. That quorum intersects with the quorum that chose v at b. The intersecting acceptor saw v accepted under b, so its Promise carries (s, b, v). The proposer's recovery rule (take the value whose accepted_ballot is highest) therefore takes v (or a later value chosen under some b'' > b, but inductively that value is also v). So v' == v. QED. The simulator implements this rule in start_election's init of prepare_accepted and in the Promise-handler's if ab > prepare_accepted[s].ballot update.

  5. Decided-once / monotonic learn. Once learned[s] is set on any node, it never changes value. Locally enforced by reading before writing; globally guaranteed by P2c.

  6. Byte-determinism of the dump. Two runs with the same (seed, nodes, rounds, proposals, partition) produce identical canonical dump bytes on every language. This requires every iteration order (peers, slots, accepted-list inside Promise, heap pops on identical (time, sender, seq)) to be fixed. Drift here is what cross_test.sh catches.

Design decisions worth highlighting

  • promised_ballot is global per node, not per-slot. This is the Multi-Paxos optimization. A per-slot promised-ballot map would be more general (closer to single-decree Paxos per slot) but would cost a Phase-1 per slot. The global ballot lets one Phase 1 cover every present and future slot.

  • Phase-1 recovery walks every prior accept, not just the latest per slot. The Promise reply contains all of the acceptor's accepts (sorted by slot). The candidate folds them into prepare_accepted with take if strictly greater accepted_ballot. Per the proof of P2c this is the only correct rule; a "latest by receive order" tie-break would lose safety the moment Promises arrived out of order.

  • my_ballot.round is bumped to max(promised, my_ballot).round + 1 when starting an election, not just promised.round + 1. If this node previously won a higher ballot and stepped down due to a partition heal, it would otherwise re-issue its old ballot and immediately lose to its own historical promise. The max makes forward progress under churn deterministic.

  • Leader-pick rule: lowest-id Leader. When the simulator must inject a client proposal, it picks the lowest-id node currently in role Leader. There may be zero (queue the proposal in cluster_pending) or, briefly, two stale Leaders (the lower id wins; the other's Accept will fail at acceptors that have already promised the new ballot). Determinism > realism here.

  • drain_pending runs on every Accepted, not just every tick. In single-node mode (--nodes 1) the leader becomes its own quorum and decides slots inside the broadcast loop. Doing the drain in become_leader and in try_decide means scenario D's hash is independent of how the simulator orders ticks.

  • Heap key (delivery_time, sender, seq). db-16's invariant. Without the seq tiebreak, Promise messages from two acceptors arriving on the same tick from the same sender (impossible by construction, but the type system doesn't know that) would be reorder-able across languages.

  • Role enum order. Follower=0, Candidate=1, Leader=2 was chosen to match db-17; any change would propagate into the dump byte at offset 12 + 16 = 28 per node, which would silently invalidate scenario A's canonical hash.

Tradeoffs worth flagging

  • Concurrent proposers cost throughput, not safety. Two proposers in dueling Phase 1 can ping-pong each other forever in principle. The lab dodges this in two ways: (a) the deterministic simulator can't sustain a livelock because election timeouts are PRNG-jittered per node-id, and (b) once a leader is elected, the election-timer reset on Heartbeat keeps it elected. Production systems add leader leases (Chubby, Spanner) to push the worst case down further.

  • No commit-only-current-term subtlety. Raft has Figure 8: a newly-elected leader must commit something in its own term before it can ack older entries, otherwise they can be silently overwritten. Paxos sidesteps the problem because P2c forces a new leader to re-Accept any recoverable value under its own ballot; there is no "shadow commit" to retract. The price is the Phase-1-on-every-startup cost.

  • No native log compaction. This lab's accepts and learned grow unboundedly. A real Multi-Paxos system snapshots a state machine and discards accepts below the snapshot index (see Spanner, Chubby, db-23). Adding snapshots here would require exposing a committed_through watermark in the dump.

  • No membership change. n is fixed at Cluster construction time. Vertical Paxos (Lamport/Malkhi/Zhou 2009) is the textbook way to add this. db-23 covers it.

  • Three languages is more work than two. Two languages prove the spec is unambiguous. Three rules out the case where you and your collaborator have committed the same misreading. C++'s std::map and Rust's BTreeMap agreeing with Go's explicit sort.Slice was the only thing that caught a misordered Promise payload in scenario B during development.

Why three languages

Same answer as db-17: the constraint forces the spec to be a spec and not a habit. Sorted-iteration discipline, fixed enum order, little-endian fixed-width integers, no map iteration on the wire — these are easy to get away with in any single language, and the only way to surface them is to ask "would another implementation make the same choice without being told?". For Paxos the question matters even more: the algorithm is sensitive to whether the highest-ballot prior accept is chosen during recovery, and a sort-order bug would make safety stochastic, which is the worst possible failure mode.

db-18 — Execution

One-shot: prove it works

cd db-18-paxos
bash scripts/verify.sh        # all three languages' unit tests
bash scripts/cross_test.sh    # 6 scenarios × 3 binaries × byte-identical hash

verify.sh must end with === verify OK ===. cross_test.sh must end with === ALL OK === and the six per-scenario hashes must match the table in docs/observation.md.

Per-language workflows

Rust

cd src/rust
cargo build --release         # builds paxos18 lib + paxosctl bin
cargo test --release          # 12 unit tests (see verification.md)
./target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Crate layout:

  • src/lib.rspaxos18 library: ballot, RPCs, PaxosNode, Cluster, canonical dump, sha256 helper, inline #[cfg(test)] module.
  • src/bin/paxosctl.rs — CLI entry: parses flags, runs the cluster, emits the sha256 hex digest on a single line with no newline.

Go

cd src/go
go build ./...                # builds package + cmd/paxosctl
go test ./...                 # 11 unit tests
./paxosctl_bin --seed 42 --nodes 3 --rounds 1000 --proposals 5

Module layout:

  • paxos.go — package db18: same surface as the Rust crate.
  • paxos_test.gogo test suite.
  • cmd/paxosctl/main.go — CLI binary.
  • go.mod — module github.com/10xdev/dse/db18, go 1.22.

C++

cd src/cpp
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
./test_db18                   # 11 unit tests
./paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Source layout:

  • include/db18/paxos.hpp + src/paxos.cpp — the db18 namespace library.
  • src/paxosctl_main.cpp — CLI entry.
  • tests/test_db18.cpp — gtest-style assertions (no framework dependency; pure asserts + main).
  • CMakeLists.txt — exposes db18_lib, paxosctl, test_db18.

CLI reference

paxosctl has the same flags in all three languages. Anything else on the command line is rejected.

FlagTypeDefaultMeaning
--seedu64requiredSeeds splitmix64 for the cluster, every node's election jitter, and every message's delivery delay.
--nodesu32 (1..=8)requiredNumber of acceptor/proposer nodes; quorum = nodes/2 + 1.
--roundsu64requiredNumber of sim-time ticks to run.
--proposalsu32requiredNumber of client values to inject. Value i is "val-{i}", scheduled at tick (i+1)*rounds/(proposals+1).
--partitioncomma list of src,dst pairs (even-length)(none)Drop every message with the listed (src, dst) ordered pairs. Asymmetric: 0,1 blocks 0→1 but not 1→0. Pass 0,1,1,0 for a symmetric link cut.

Output: a single line of lowercase hex (64 chars), no trailing newline. Exit code 0 on success; non-zero with a stderr message on parse error.

Sample invocations

# Single-node "consensus" — leader is itself, every proposal decides instantly.
paxosctl --seed 1 --nodes 1 --rounds 200 --proposals 5

# Three-node happy path.
paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

# Symmetric partition between 0 and 1 plus 0 and 2 — node 0 is isolated.
paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 3 \
  --partition 0,1,0,2,1,0,2,0

Canonical scenarios

These are the six configurations that cross_test.sh runs. Each combination is a known-stable byte fingerprint; if any of them changes, you have changed semantics and should expect the cross-test to fail until you understand why.

NameFlagsNotes
A--seed 42 --nodes 3 --rounds 1000 --proposals 5happy path, 3-node, no partition
B--seed 7 --nodes 5 --rounds 2000 --proposals 205-node, longer schedule, more decisions
C--seed 99 --nodes 3 --rounds 500 --proposals 0leader election only; no proposals
D--seed 1 --nodes 1 --rounds 200 --proposals 5single-node; quorum = self
E--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0node 0 isolated symmetrically; {1,2} retain quorum
F--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1asymmetric link cut; minor degradation

Sanity checks

If you only have ten seconds:

( cd src/rust && cargo build --release ) >/dev/null && \
( cd src/go && go build -o paxosctl_bin ./cmd/paxosctl ) >/dev/null && \
( cd src/cpp/build && cmake --build . --target paxosctl ) >/dev/null && \
diff <(src/rust/target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5) \
     <(src/go/paxosctl_bin            --seed 42 --nodes 3 --rounds 1000 --proposals 5) && \
diff <(src/rust/target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5) \
     <(src/cpp/build/paxosctl         --seed 42 --nodes 3 --rounds 1000 --proposals 5) && \
echo OK

Silence + OK = green. Any diff = divergence; jump to docs/observation.md § Divergence runbook.

db-18 — Observation

Expected canonical hashes

Six configurations are pinned in scripts/cross_test.sh. The lab is green iff all three binaries (Rust release, Go release, C++ Release) emit exactly these strings on stdout (no trailing newline):

NameFlagsSHA-256 of canonical dump
A--seed 42 --nodes 3 --rounds 1000 --proposals 50a35fdad1dd97c76a40a61b020c6181a56c4a40d4f723cb68fe70c2112aa9b63
B--seed 7 --nodes 5 --rounds 2000 --proposals 203cc6cae6cb7f9d2b7cb88088a0f22581ac4c41bd86bab1b3676dd0ba33fd7ead
C--seed 99 --nodes 3 --rounds 500 --proposals 0f28d025af748a790beded6167115c7094a7f939b45d439728e4d6b7e144c3be0
D--seed 1 --nodes 1 --rounds 200 --proposals 5e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139
E--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0674e62d809248ac99401054c195d29b0e2eed6ccc78ec45e96da8aaf69c36096
F--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,17d80176abad54e533b2f4174e84f58432a000255fbb2ecbbb1dd915cb6bb6ab5

These are the contract. Edit any production code such that one of these strings changes and you have changed semantics; reverify end-to-end before you ship.

Walking the wire: scenario D byte-by-byte

Scenario D is the shortest possible dump (one node, five proposals, all decided locally). Use it as a Rosetta Stone before debugging the multi-node hashes. The layout is magic || u32 node_count || node[], and the node payload starts at offset 12.

00..07  4453 4550 4158 3031     "DSEPAX01"            magic
08..0b  01 00 00 00             node_count = 1
0c..0f  00 00 00 00             node.id = 0
10..13  rr rr 00 00             node.promised_ballot.round       (round it won at)
14..17  00 00 00 00             node.promised_ballot.proposer_id (= self.id = 0)
18      02                      node.role = Leader (2)
19..1c  rr rr 00 00             node.my_ballot.round
1d..20  00 00 00 00             node.my_ballot.proposer_id
21..24  05 00 00 00             accept_count = 5
... 5 × {u64 slot, u32 ab.round, u32 ab.proposer_id, u32 value_len, value bytes}
... then u32 learned_count = 5 and 5 × {u64 slot, u32 value_len, value bytes}

Run:

src/rust/target/release/paxosctl --seed 1 --nodes 1 --rounds 200 --proposals 5
# e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139

To dump the raw bytes (skip the sha256 step) hack the binary to print canonical_dump instead of sha256_hex(&canonical_dump); do it locally only — the canonical CLI output is the sha256.

Walking the wire: scenario C (no proposals)

Scenario C runs three nodes for 500 ticks with --proposals 0. Exactly one of them will be elected leader; nobody decides anything. The dump therefore has accept_count == 0 and learned_count == 0 for every node. The bytes that do change between languages if you have an iteration-order bug are the per-node promised_ballot.round values (the elected leader's round depends on whether some other proposer almost-elected first). If C is the failing scenario, you have an election-timer determinism bug, not a Phase-2 bug.

Divergence runbook

If cross_test.sh prints MISMATCH scenario X, follow this script:

# 1. Capture the raw bytes from each binary. Patch paxosctl locally
#    to print `canonical_dump` raw instead of sha256 hex, run once,
#    then revert the patch. Save to rust.bin, go.bin, cpp.bin.

cmp -l rust.bin go.bin | head
cmp -l rust.bin cpp.bin | head

cmp -l prints byte_offset rust_value go_value in octal. Map the first offset to the field it belongs in:

OffsetFieldLikely culprit
0..7magic "DSEPAX01"wrong magic literal
8..11node_countwrong u32_le writer, wrong endianness
12 + k*node_size + 0..3node.iditerating nodes in wrong order (not ascending id)
12 + k*node_size + 4..11promised_ballotelection-timer drift or wrong PRNG seed mix
12 + k*node_size + 12role (1 byte)enum reordered (must be Follower=0, Candidate=1, Leader=2)
12 + k*node_size + 13..20my_ballotstep-down logic differs (e.g., resetting my_ballot to zero or not)
12 + k*node_size + 21..24accept_countone acceptor accepted a slot the others did not — Phase-2 message ordering bug
inside an accept tupleslotaccepts iterated in receive order, not sorted by slot
inside an accept tupleaccepted_ballotPhase-1 recovery used a wrong rule (e.g., last-write-wins instead of highest-ballot)
inside an accept tuplevalue_len / valuewrong proposal scheduled at this slot — proposal-injection rule or leader-pick rule differs
inside the learned sectionslot / valuethe difference is downstream of an accept-section difference; fix that first

Tick-level diff

If cmp -l flags a divergence inside the accepts of node 1, add eprintln!/fmt.Fprintln(os.Stderr, ...)/std::cerr lines in each implementation at the boundaries of the suspect ticks:

#![allow(unused)]
fn main() {
// after handle() and after on_tick():
eprintln!("t={} id={} promised={:?} role={:?} my={:?} accepts={:?} learned={:?}",
          t, id, n.promised_ballot, n.role, n.my_ballot, n.accepts, n.learned);
}

Run all three, diff -u rust.log go.log. The first differing tick is the bug.

Most common culprits in practice

  1. Forgetting to sort the Promise payload by slot. Go's map iteration order is randomized; you must sort.Slice before appending to the wire.
  2. Reading next_slot before recovering from prepare_accepted. If recovery doesn't update next_slot = max + 1, the leader will double-allocate a slot that already has a recovered accept, silently overwriting it.
  3. Letting step_down clear promised_ballot. Promises are forever; only my_ballot is candidate-state.
  4. Counting yourself twice in accept_count. Both become_leader and try_decide insert self; the second one is a no-op only if accept_count is a set, not a multiset.
  5. Iterating peers as for p in nodes.iter() on a HashMap. Use BTreeMap in Rust, std::map in C++, and explicit for p := uint32(0); p < n; p++ in Go.

db-18 — Verification

Prerequisites

  • Rust ≥ 1.74 with cargo on PATH.
  • Go ≥ 1.22 (module declares go 1.22).
  • CMake ≥ 3.20 and a C++17 compiler (Apple clang ≥ 14, gcc ≥ 11).
  • A POSIX sha256sum is not required — each binary computes its own sha256 in-process.

One command

cd db-18-paxos
bash scripts/verify.sh && bash scripts/cross_test.sh

Green is === verify OK === followed by === ALL OK ===. Anything else is a regression.

What verify.sh does

  1. Rustcargo build --release then cargo test --release. Builds paxos18 lib + paxosctl binary; runs the 12 inline tests in src/rust/src/lib.rs. Expected output ends with test result: ok. 12 passed.
  2. Gogo build ./... then go test ./.... Builds cmd/paxosctl + package; runs the 11 tests in src/go/paxos_test.go. Expected output ends with PASS and ok github.com/10xdev/dse/db18.
  3. C++cmake -DCMAKE_BUILD_TYPE=Release .., make -j, then ./test_db18. Builds db18_lib, paxosctl, test_db18; the test binary prints one line per assertion-group and ends with ALL 11 TESTS PASSED.

If any of these three blocks fails, the script exits non-zero and the rest does not run.

What cross_test.sh does

For each of the six canonical scenarios (A–F), it invokes the three release binaries with identical flags, captures stdout, and asserts rust == go == cpp byte-for-byte. The output prints the matching hash on success; on mismatch it prints all three hashes and exits.

The script does not trust the canonical hashes from this repo to be correct — it only enforces consistency among the three implementations. The "is the hash also the historical fingerprint" check happens by comparing the script's output against docs/observation.md § Expected canonical hashes.

What green guarantees

If both scripts pass:

  1. Safety in the modeled environment. For every seed × scenario in the suite, no acceptor stored a decided value that contradicts another node's decided value for the same slot. The unit tests include cases for dueling proposers, partitions, and Phase-1 recovery; the cross-test sweeps the same scenarios across three independent implementations.
  2. Determinism. Same inputs ⇒ same canonical dump bytes, across languages and across machines (modulo endianness — all targets are little-endian).
  3. Liveness in the modeled environment. Scenarios A, B, D, F all include proposals and run long enough to elect a leader and decide them. Scenarios C and E exist to confirm we don't decide when we shouldn't (C has no proposals; E isolates node 0 so it must not influence the chosen value while {1,2} still carry the load).

What green does not guarantee

  • Behavior outside the canonical scenarios. The state space of three-process Multi-Paxos is exponential; six fingerprints are an acceptance test, not a model checker. A real Paxos audit needs TLA+ (see references.md § Background reading).
  • Performance. No latency or throughput is checked. Scenario A takes ~150 ticks of simulated time to decide; that is a function of the configured ELECTION_TIMEOUT_MIN, not a wall-clock SLA.
  • Snapshotting, membership change, log compaction. None of these exist in this lab; the dump grows unboundedly in accepts and learned. db-23 covers the rest.
  • Production safety primitives — leader leases, fsync barriers, on-disk checksums, recovery from torn writes, byzantine actors. All deliberately out of scope.

Invariant assertions in code

Each implementation re-checks the lab's invariants where the cost is near-zero. The most load-bearing assertions are listed below; their firing means the test that triggered them is reporting a symptom of a Phase-1 / Phase-2 bug, not a flaky test.

WhereAssertionWhat it catches
Handle::Promise (all 3 langs)leader ignores Promise if b != my_ballotstale Promise replies from a previous Phase 1 (would inflate the quorum count and decide too early)
Handle::Accepted (all 3 langs)leader ignores Accepted if b != my_ballotsame, for Phase 2
try_decide (all 3 langs)only the current Leader can mark a slot learneda stepped-down node attempting to declare a decision (would split-brain learned)
Promise payload serialization (all 3 langs)accepts iterated in ascending slot orderundetected map-iteration drift between languages
canonical_dump writer (all 3 langs)nodes in ascending id; per-node accepts and learned in ascending slotdrift between three independent dump writers
Rust unit single_node_in_three_node_partition_does_not_decideisolated minority must have empty learneda quorum-counting bug that lets a single node decide
Go unit TestMajorityRequiredToDecide1-of-3 cannot decidesame, Go side
C++ unit cannot_decide_in_minority1-of-3 cannot decidesame, C++ side

db-18 — Broader Ideas

The lab implements textbook Multi-Paxos with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.

Variants and refinements

Fast Paxos (Lamport 2006)

Skips Phase 2's "leader replays" step on the happy path by letting any proposer broadcast Accept directly. The cost: the fast-path quorum must be ⌈3n/4⌉ instead of ⌊n/2⌋ + 1, so 4-of-5 instead of 3-of-5. When two proposers collide on the fast path the system falls back to classic Paxos. Worth implementing as db-18b once the classic version is fluent — it reuses the entire wire format and only changes the proposer-side state machine.

EPaxos (Moraru, Andersen, Kaminsky, SOSP 2013)

Drops the leader entirely. Each command picks its own dependency graph among recently-issued commands and decides in one RTT if no conflict, two RTTs otherwise. The "deterministic simulator + three implementations" discipline you build here is what makes EPaxos's notoriously subtle conflict-detection logic testable at all. Used in production at Facebook (Bunshin) and as the backbone of some geo-distributed configuration stores.

Generalized Paxos (Lamport, MSR-TR-2005-33)

Allows commutative commands to be partially ordered concurrently, not totally ordered serially. The state-machine layer must explicitly declare command commutativity. Precursor to EPaxos. Operationally similar to CRDTs at the storage layer (db-21) but with hard consensus underneath.

Vertical Paxos (Lamport, Malkhi, Zhou, PODC 2009)

Separates the "agree on the value at slot S" problem from the "agree on the membership of the acceptor set at slot S" problem, by delegating reconfiguration to an auxiliary master. Cleaner than joint-consensus (Raft's approach) and Lamport's preferred way to do membership changes. db-23 will revisit.

Flexible Paxos (Howard 2016, dissertation 2019)

Observation: the two quorums in Paxos don't have to be majorities. They only have to intersect. So Phase-1 quorum + Phase-2 quorum just have to sum to more than n. Production payoff: you can run with a smaller Phase-2 quorum (lower latency on the common path) in exchange for a larger Phase-1 quorum (higher cost during leadership churn). A great teaching variant to layer on top of this lab once the canonical hashes are stable.

Production systems to study

Google Chubby

Five-replica Paxos lock service powering Google's lookup infrastructure (DNS, leader election for other services). Chandra et al.'s Paxos Made Live (PODC 2007) is the canonical writeup of what it took to turn the algorithm into a system: leader leases, snapshots every few minutes, master-side group membership, three generations of disk-corruption handling. Read alongside this lab once green.

Google Spanner

Multi-Paxos per shard. Spanner's contribution above Multi-Paxos is TrueTime — a clock API with bounded uncertainty that lets the system serve external-consistency-preserving reads without a Paxos round. The Paxos layer itself is exactly the algorithm you've implemented, plus production hardening.

Apache Cassandra LWT

Lightweight Transactions use Multi-Paxos to give linearizable CAS-style updates on top of Cassandra's eventually-consistent replication. Cassandra picks a fresh ballot per request, so it pays the Phase-1 cost every time and never amortizes — a clean illustration of the Multi-Paxos tradeoff in reverse.

Microsoft Azure Service Fabric

Uses a Paxos variant (Smart Actors) under the hood for ring-leader election and replicated state services. Less publicly documented; the architectural papers are paywalled behind ASE/SOSP, but worth chasing for an industrial counterpoint.

Apache ZooKeeper (ZAB)

Not strict Paxos but in the same family. ZAB layers epoch+counter on top of a primary-order protocol; the zxid pair is the direct analogue of this lab's Ballot. db-19 builds it.

Performance experiments worth running

The deterministic simulator is too clean for real performance work, but the simulator's ticks are a fine unit of cost for comparative experiments:

  • Phase-1 amortization sweep. For nodes ∈ {3,5,7,9}, run proposals = 50 and count the number of ticks to decide the last slot. The expected curve is linear in nodes for the first decision (Phase 1 costs a broadcast round-trip per acceptor) and constant per slot thereafter (Phase 2 RTT).
  • Election-timer jitter sensitivity. Vary ELECTION_TIMEOUT_SPAN and measure how often dueling proposers ping-pong before someone wins. The textbook answer is "wider jitter = fewer collisions = fewer ballot bumps", and the simulator lets you confirm it without networking.
  • Quorum recomputation latency. For Flexible Paxos configurations, plot Phase-2 latency against Phase-1 quorum size. Howard 2016 has the analytical curve; you can ground-truth it.
  • Comparison to Raft (db-17). Same flags, same scenarios, same measurement. The lab structure is identical on purpose.

What "production-quality" would require beyond this lab

  • Disk durability. A real acceptor fsyncs promised_ballot, accepts, and (depending on design) learned before replying to a Promise / Accepted. Without that, a crash-restart cycle can silently retract a promise and break safety.
  • Snapshotting. accepts and learned grow forever in this lab. A real system periodically snapshots the state machine and garbage-collects acks below the snapshot index. The snapshot itself must be agreed on by Paxos (or by a separate snapshot coordinator), which is a whole-other lab.
  • Membership reconfiguration. Adding/removing acceptors safely is non-trivial: you must either run two configurations in parallel during the transition (Raft's joint consensus) or delegate the membership decision to a higher layer (Vertical Paxos). db-23 picks this up.
  • Leader leases. Production Paxos systems give the current leader a time-bounded lease to serve reads without consulting acceptors. This requires a synchronized clock model (Spanner's TrueTime, or weaker lease-renewal protocols) — orthogonal to consensus per se but tightly coupled in real deployments.
  • Witness / arbiter nodes. Some deployments allow a third node to hold no data but break tie-vote symmetry. Implementing this while keeping safety proofs sound requires care.
  • Recovery from disk corruption. Real-world failure modes include silent bit-rot of promised_ballot. The defensive posture is to checksum every persisted record and treat a checksum failure as "I've never voted for anything" — a strict safety superset of treating it as "I voted for a high ballot", but at the cost of liveness during recovery.
  • Observability. Live systems need per-slot decision latency histograms, per-acceptor promise rejection counters, leader flap detection. The canonical dump is the right shape of observability but the real one runs continuously rather than on-demand.

db-18 step 01 — Single-decree Paxos

Goal

Build the two-phase Paxos protocol for one slot. A proposer must be able to drive a value to a decision in the presence of competing proposers, and an acceptor's recorded state must be exactly enough for the next proposer to recover any value that might have already been chosen. The byte layout of acceptor state must be identical across Rust, Go, and C++.

Tasks

  1. Ballot. Define Ballot { round: u32, proposer_id: u32 } with lexicographic ordering (round first, then proposer_id as tie-break). Provide a Ballot::ZERO constant equal to (0, 0). Every comparison in the rest of the protocol uses this order.

  2. PaxosNode skeleton. Each node carries:

    • id: u32, n: u32 (cluster size), quorum = n/2 + 1.
    • role: Role (Follower / Candidate / Leader).
    • promised_ballot: Ballot — highest ballot ever promised (one value, shared across all slots in this Multi-Paxos style).
    • my_ballot: Ballot — this proposer's current attempt.
    • accepts: BTreeMap<Slot, (Ballot, Vec<u8>)> — for each slot, the highest-ballot accept observed.
    • learned: BTreeMap<Slot, Vec<u8>> — decided values, in slot order.
  3. on_prepare(ballot). If ballot >= promised_ballot, set promised_ballot = ballot and reply Promise { accept_ok = true, accepted = [(slot, ab, value) for every entry in accepts] }. Otherwise reply Promise { accept_ok = false, accepted = [] }. The full walk over accepts is what makes Phase 1 the recovery step.

  4. on_promise. A proposer collects promises until it has a quorum. For each slot mentioned in any promise, it adopts the value of the highest-ballot accept (Paxos safety property P2c). For slots with no prior accept, the proposer is free to use its own pending value. The proposer then transitions to Leader and broadcasts Accept for every slot in its working set.

  5. on_accept(ballot, slot, value). If ballot >= promised_ballot, update promised_ballot = ballot, overwrite accepts[slot] = (ballot, value), reply Accepted { accept_ok = true }. Otherwise reply Accepted { accept_ok = false }. Note that an accepted value is never unaccepted — only superseded by a higher-ballot accept on the same slot.

  6. on_accepted. A proposer that collects a quorum of accept_ok = true for the same (slot, ballot) learns the value and broadcasts Decided { slot, value }. Learners (every node) record learned[slot] = value on Decided.

Acceptance

Inline unit tests in each language. Names below are the Rust form; Go uses TestSha256KnownVectors style, C++ uses test_sha256_known_vectors:

  • sha256_known_vectors — empty, "abc", and the lazy-dog vector all hash to the well-known constants. Locks the SHA-256 implementation to RFC 6234.
  • dueling_proposers_higher_ballot_wins — acceptor promises (1,1), then (1,2) arrives and is promised; a stale Accept at (1,1) is nacked. Verifies promised_ballot monotonicity.
  • promise_carries_prior_accept_for_recovery — acceptor with a prior accept at ballot b1 on slot 0 receives a Prepare at higher ballot b2; the Promise must include the (0, b1, value) tuple so the new leader can re-propose the value. This is P2c.
  • majority_required_to_decide — proposer in a 5-node cluster with only 2 of 5 accepts must not call the slot decided; the third accept tips it over the threshold.
  • ballot_ordering_is_lexicographic(1,9) < (2,0), (1,0) < (1,1), ZERO < (0,1). Locks the comparator.

All five green in Rust, Go, and C++.

Discussion prompts

  • Quorum intersection. Why must any two quorums share at least one acceptor? Walk through what breaks if a 4-node cluster used quorum size 2 instead of 3.
  • Why P2c. Suppose Phase 1 returned just accept_ok without the list of prior accepts. Construct a 3-node run where a value v is chosen, then a higher-ballot proposer chooses a different value w. Why does carrying prior accepts forward in the Promise prevent this?
  • Ballots vs terms. Raft's term is a single u64. Paxos's ballot is (round, proposer_id). What does the proposer_id tie-break buy you that a single counter would not, and why does Raft not need it?

db-18 step 02 — Multi-Paxos and the replicated log

Goal

Generalise single-decree Paxos into a log. A stable leader runs Phase 1 once, then drives a sequence of slots through Phase 2 only — that is the entire point of Multi-Paxos. Newly elected leaders must recover any partially-accepted slots before issuing new proposals, so the log stays contiguous and every committed prefix is identical on every replica.

Tasks

  1. Leader election trigger. A Follower or Candidate whose election_deadline elapses bumps my_ballot.round = max(my_ballot.round, promised_ballot.round) + 1, sets my_ballot.proposer_id = self.id, transitions to Candidate, and broadcasts Prepare { ballot: my_ballot }. Election deadline is reset with the same splitmix64 jitter formula as Raft: t + 150 + splitmix64(seed ^ id ^ t) % 150.

  2. become_leader. On collecting quorum promises for my_ballot, transition to Leader, then:

    • Compute next_slot = max(slot in any promise.accepted) + 1, defaulting to max(learned.keys()) + 1 if no accepts were reported.
    • For every slot in [0, next_slot) that appears in any promise: adopt the value of the highest-ballot accept and broadcast Accept { my_ballot, slot, value } (this is the recovery sweep — it re-proposes potentially-chosen values under the new ballot).
    • Call drain_pending to attach pending client values to the next free slots, broadcasting Accept for each.
  3. Heartbeat. A Leader whose heartbeat_due elapses broadcasts Heartbeat { ballot: my_ballot }. Followers reset their election timers on any inbound RPC from the current leader. This is what makes Multi-Paxos amortise Phase 1: as long as heartbeats arrive, no one starts a new ballot.

  4. Decided broadcast. When a leader's try_decide(slot) sees a quorum of accept_ok=true for the slot's ballot, it marks learned[slot] = value and broadcasts Decided { slot, value } to every node. Learners record the value in learned; the leader does not need to re-decide on receipt.

  5. Lowest-id leader rule. When tests inspect "the" leader of a cluster, they pick the Leader with the lowest id. This is a deterministic tie-break for the (rare) case where two nodes briefly both believe themselves leader during a flap; the safety invariants do not depend on at-most-one Leader, only on at-most-one chosen value per slot per ballot.

Acceptance

Inline unit tests in each language:

  • single_node_decides_every_proposal — a 1-node cluster (quorum 1) with proposals = 3 ends with learned = [(0, "val-0"), (1, "val-1"), (2, "val-2")]. Degenerate case but verifies the leader path.
  • three_node_elects_single_leaderCluster::new(42, 3) after 500 ticks with zero proposals has exactly one node in role Leader.
  • three_node_replicates_proposalsCluster::new(42, 3) after 1000 ticks with proposals = 5 has every node's learned of length 5 and byte-identical to node 0's.
  • multi_slot_log_is_contiguous — 10 proposals on a 3-node cluster yield slot keys 0..10 on every node, no gaps.
  • partition_heals_progress_resumes — drop all traffic between node 0 and the other two; the surviving pair {1, 2} still elects a leader and decides 4 proposals. Demonstrates that Multi-Paxos liveness depends on some quorum being connected, not on the original leader being reachable.

All five green in Rust, Go, and C++.

Discussion prompts

  • Amortisation. Why is the Phase 1 cost paid only at leader change in Multi-Paxos but on every decision in single-decree Paxos? What is the steady-state message count per decision on a 5-node cluster?
  • Leader leases. Real systems (Spanner, Chubby) layer a lease on top of Multi-Paxos so the leader can serve linearizable reads without quorum. What changes in the safety argument if you serve reads off the leader without a lease?
  • Recovery cost. A new leader must walk every acceptor's full accepts map for the recovery sweep. What is the message size in bytes for a log with 1M slots and 256-byte values? What optimisations (truncation, snapshots, min_slot exchange) would you add for production?

db-18 step 03 — Cross-language determinism

Goal

The Rust, Go, and C++ implementations must, given the same (seed, nodes, rounds, proposals, partition) CLI inputs, produce the byte-identical canonical dump and therefore the same SHA-256. This is the third leg of the lab: protocol correctness plus simulator determinism plus serialisation discipline.

Tasks

  1. Discrete-event simulator. A Cluster owns a min-heap of pending RPCs keyed (delivery_time, src, seq). seq is a monotonically increasing per-cluster counter assigned at send time, breaking ties when two RPCs from the same sender become deliverable on the same tick. Every send pushes onto the heap; every tick pops everything due, dispatches it via node.handle(rpc, src, t, &mut out), and pushes any reply RPCs back onto the heap with delivery = t + 1 + splitmix64(seed ^ src ^ dst ^ seq) % 3.

  2. Iteration discipline. All iteration over collections of nodes, slots, or peers must be in sorted order. Rust uses BTreeMap / BTreeSet exclusively. Go uses sort.Slice / sort.Ints before every loop over a map's keys. C++ uses std::map / std::set. A single iteration over a hash map anywhere in the protocol path will diverge across languages on ~2000 ticks.

  3. Partition modelling. The Cluster carries a Drop: Set<(u32, u32)> of dropped unidirectional edges. The CLI flag --partition s,d,s,d,... parses pairs and inserts them. Asymmetric partitions are intentional: --partition 0,1 only drops 0→1 traffic, not 1→0. Scenario F exercises this.

  4. Canonical dump. canonical_dump(&cluster) writes:

    "DSEPAX01"                     (8 bytes magic)
    u32_le(node_count)
    for each node in ascending id:
        u32_le(id)
        u32_le(promised_ballot.round)
        u32_le(promised_ballot.proposer_id)
        u8(role)                   (Follower=0, Candidate=1, Leader=2)
        u32_le(my_ballot.round)
        u32_le(my_ballot.proposer_id)
        u32_le(accepts_len)
        for each (slot, (ballot, value)) in accepts, ascending slot:
            u64_le(slot)
            u32_le(ballot.round)
            u32_le(ballot.proposer_id)
            u32_le(value_len)
            value bytes
        u32_le(learned_len)
        for each (slot, value) in learned, ascending slot:
            u64_le(slot)
            u32_le(value_len)
            value bytes
    

    Hash the bytes with SHA-256, print lowercase hex, no trailing newline.

  5. CLI: paxosctl. Each language ships a binary that accepts --seed <u64> --nodes <u32> --rounds <u32> --proposals <u32> [--partition s,d,...], runs the cluster for rounds ticks with proposals scheduled at tick = (i+1) * rounds / (proposals+1), value = b"val-" + itoa(i), dumps canonical bytes, prints the hex SHA-256.

  6. scripts/cross_test.sh. Builds all three binaries, runs the 6 scenarios A–F against each, compares the three hashes to the canonical table, and exits non-zero on mismatch. The script ends with === ALL OK === on success.

Acceptance

Inline unit tests in each language:

  • dump_deterministic_across_runs — two independent Cluster::new(42, 3) instances each run 1000 ticks with 5 proposals produce byte-identical dumps. Confirms intra-language determinism.
  • Scenario A --seed 42 --nodes 3 --rounds 1000 --proposals 50a35fdad1dd97c76a40a61b020c6181a56c4a40d4f723cb68fe70c2112aa9b63
  • Scenario B --seed 7 --nodes 5 --rounds 2000 --proposals 203cc6cae6cb7f9d2b7cb88088a0f22581ac4c41bd86bab1b3676dd0ba33fd7ead
  • Scenario C --seed 99 --nodes 3 --rounds 500 --proposals 0f28d025af748a790beded6167115c7094a7f939b45d439728e4d6b7e144c3be0
  • Scenario D --seed 1 --nodes 1 --rounds 200 --proposals 5e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139
  • Scenario E --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0674e62d809248ac99401054c195d29b0e2eed6ccc78ec45e96da8aaf69c36096
  • Scenario F --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,17d80176abad54e533b2f4174e84f58432a000255fbb2ecbbb1dd915cb6bb6ab5

All six match across Rust, Go, and C++; bash scripts/cross_test.sh exits 0 with === ALL OK ===.

Discussion prompts

  • Sort discipline. Find the language-default hash map in your language. What is its iteration order? What is the cost of replacing it with the language's ordered map for the canonical dump path only versus everywhere?
  • SplitMix64. Why is splitmix64 a good fit for a deterministic simulator clock when something like rand::thread_rng() is not? Walk through the three constants — what are they and why?
  • Three languages. What classes of bug does the cross-language test catch that a single-language test cannot? (Hint: think signed-vs-unsigned overflow, default hash randomisation, iteration order, integer-promotion rules in comparisons.)

db-19 — ZAB (ZooKeeper Atomic Broadcast)

This lab implements ZAB — the atomic broadcast protocol that drives Apache ZooKeeper — in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It inherits the deterministic-simulator discipline of db-16 and db-17: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule.

Where db-17 Raft taught you that one consensus algorithm can be pinned down to a single byte sequence across three languages, db-19 ZAB does the same exercise for a different algorithm with a meaningfully different recovery story: an explicit Discovery / Synchronization phase between leader election and steady-state broadcast, and a transaction identifier (zxid) that pairs an epoch with a counter rather than Raft's single monotonic term + index.


What is it?

ZAB (Reed & Junqueira, LADIS 2008; Junqueira, Reed & Serafini, DSN 2011) is the primary-backup atomic broadcast protocol that ZooKeeper uses to keep its replicated state machine consistent. It is not a generic consensus library; it was designed specifically for ZooKeeper's workload: a small, well-known cluster (3, 5, 7 nodes), a small in-memory state machine, and a strong primary-order guarantee that arbitrary client requests served by the same primary are delivered in the order the primary chose.

ZAB decomposes into four phases. Phase 0 is the original FastLeader- Election; later papers fold it into Phase 1.

  1. Leader election (FastLeaderElection). Every node starts in Looking. Each node broadcasts its current vote — initially for itself — carrying (last_zxid, server_id). On receiving a peer vote, a Looking node updates its own vote to that peer if (peer.last_zxid, peer.id) > (own.last_zxid, own.id) lexicographically, then re-broadcasts. When a quorum of voters agree on the same target, that node is elected: it transitions to Leading, everyone else who voted for it transitions to Following.

  2. Discovery. The new prospective leader picks a fresh new_epoch = max(accepted_epoch, current_epoch) + 1, sets its own accepted_epoch = new_epoch, and broadcasts NewEpoch(new_epoch). Each follower that accepts updates its accepted_epoch and replies with AckEpoch(current_epoch, last_zxid). Once a quorum of AckEpochs arrives, the leader knows the highest (current_epoch, last_zxid) in the surviving quorum — that node's history is the one that must survive.

  3. Synchronization. The leader bumps its current_epoch = new_epoch, resets the per-epoch counter, and broadcasts NewLeader(new_epoch, history) — the whole history that this epoch will start from. Followers replace their local history with the leader's, set current_epoch = new_epoch, and reply AckLeader(new_epoch). On a quorum of AckLeaders the leader declares itself synced and broadcasts Commit(last_zxid_of_history) so followers can advance last_committed past the synced tail.

  4. Broadcast (steady state). Now indistinguishable from Raft's replication phase, modulo names. For each client proposal, the leader assigns zxid = (current_epoch, ++counter), appends to its history, broadcasts Propose(txn). Followers append in zxid order and reply Ack(zxid). On Ack quorum the leader broadcasts Commit(zxid). Heartbeats are implemented as periodic re-sends of the last Commit (or NewLeader during pre-sync) — receiving one from the current leader refreshes the follower's election timer.

The simulator drives sim time forward in integer ticks; messages are scheduled into a heap with deterministic (delivery_time, sender, seq) order; an optional partition set drops messages in named directions.


Why does it matter?

  • ZAB is the algorithm under ZooKeeper — and ZooKeeper is the coordination kernel under Kafka (pre-KRaft), HBase, Hadoop YARN, Mesos, Cassandra's lightweight transactions (historically), Druid, and a long list of production systems. Knowing exactly how the NewLeader / Sync handshake works is the difference between operating ZooKeeper and understanding it.

  • ZAB and Raft cover the same problem with meaningfully different shapes. ZAB has an explicit recovery handshake that Raft folds into the AppendEntries consistency check; ZAB's zxid = (epoch, counter) is essentially Raft's (term, index), but the role each plays is subtly different. Implementing both back-to-back makes the contrast concrete instead of conceptual.

  • Three byte-identical implementations force the spec to be unambiguous. Anywhere ZAB "depends on the implementation" — election tie-break, vote rebroadcast on update, AckEpoch idempotency, heap scheduling — has to be pinned down. The cross-language sha256 makes drift loud.

  • Reproducible partitions. With a deterministic --partition s,d,... flag and a seeded simulator, you can replay the exact sequence of message drops, leader churn, and committed transactions that triggered a bug, on any machine, in any of the three languages.

  • Foundation for the rest of the track. db-20 distributed-kv will plug a consensus engine into a real key-value store; db-23 capstone composes the simulator harness across multiple replicated shards.


How does it work?

State (per node)

persistent  : current_epoch   : u32     # epoch of the leader we accepted into sync
              accepted_epoch  : u32     # epoch we've ack'd via NewEpoch (>= current_epoch)
              history         : Vec<Txn>
              last_committed  : ZxId

volatile    : role            : Looking | Following | Leading
              leader_id       : Option<u32>

election    : vote_target_id  : u32              # who we currently vote for
              vote_target_zxid: ZxId             # the (last_zxid) we voted on
              vote_view       : Map<voter_id, leader_id>   # tally

leader-only : pending_new_epoch : u32
              epoch_acks        : Set<follower_id>   # AckEpoch quorum tracker
              leader_acks       : Set<follower_id>   # AckLeader quorum tracker
              synced            : bool
              next_counter      : u32                # zxid counter under current_epoch
              ack_set           : Map<ZxId, Set<follower_id>>

timers      : election_deadline   : u64                # sim-time tick
              last_heartbeat_sent : u64

Election timer

reset_election_deadline(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

A 150-tick base plus 150 ticks of seeded jitter avoids split-vote loops. Heartbeats fire every 50 ticks once a leader is synced.

FastLeaderElection (Phase 0)

on entering Looking:
    vote_target_id   = self.id
    vote_target_zxid = self.last_zxid()
    vote_view.clear(); vote_view[self.id] = self.id
    broadcast LookForLeader { self.id, self.last_zxid, current_epoch }
    broadcast Vote          { self.id, self.last_zxid, current_epoch, leader=self.id }
    check_election()

on Vote(voter, peer_zxid, _, leader_chosen) while Looking:
    if (peer_zxid, voter) > (vote_target_zxid, vote_target_id):
        vote_target_id   = voter
        vote_target_zxid = peer_zxid
        vote_view.clear(); vote_view[self.id] = voter
        broadcast Vote { self.id, self.last_zxid, current_epoch, leader=voter }
    vote_view[voter] = leader_chosen
    check_election()

check_election():
    target = vote_target_id
    if count(v in vote_view.values() : v == target) >= quorum:
        if target == self.id: become_leading()
        else:                 become_following(target)

LookForLeader is structurally a Vote for the sender: it lets a late-arriving node bootstrap a tally without waiting for the next broadcast cycle. Non-Looking peers reply to a Vote with their own current vote (which points at the live leader), so isolated nodes converge fast on heal.

Discovery & Synchronization (Phases 1–2)

become_leading():
    role = Leading
    pending_new_epoch = max(accepted_epoch, current_epoch) + 1
    accepted_epoch    = pending_new_epoch
    epoch_acks = {self.id}
    broadcast NewEpoch(pending_new_epoch)
    try_finish_discovery()      # handles the n=1 case immediately

on NewEpoch(e) from L:
    if e > accepted_epoch:
        accepted_epoch = e
        if role != Following: become_following(L)
        reply AckEpoch(current_epoch, last_zxid)
    elif e == accepted_epoch:
        reply AckEpoch(current_epoch, last_zxid)   # idempotent
    reset_election_deadline()

on AckEpoch from F (only if Leading):
    epoch_acks += F
    try_finish_discovery()

try_finish_discovery():
    if |epoch_acks| < quorum: return
    current_epoch = pending_new_epoch
    next_counter  = 0
    leader_acks   = {self.id}
    broadcast NewLeader(current_epoch, history.clone())
    try_finish_sync()

on NewLeader(e, hist) from L:
    if e >= accepted_epoch:
        accepted_epoch = e
        current_epoch  = e
        history        = hist          # follower truncates / extends to leader's history
        if role != Following: become_following(L)
        reset_election_deadline()
        reply AckLeader(e)

on AckLeader(e) from F (only if Leading and e == current_epoch):
    leader_acks += F
    try_finish_sync()

try_finish_sync():
    if synced or |leader_acks| < quorum: return
    synced = true
    if last_zxid() > last_committed:
        last_committed = last_zxid()
        broadcast Commit(last_committed)

Broadcast (Phase 3)

propose(payload):
    require role == Leading and synced
    next_counter += 1
    zxid = (current_epoch, next_counter)
    history.push(Txn { zxid, payload })
    ack_set[zxid] = {self.id}
    broadcast Propose(Txn { zxid, payload })
    try_commit(zxid)                   # single-node case

on Propose(txn) from L (only if Following and L == leader_id):
    if txn.zxid > last_zxid():
        history.push(txn)
        reset_election_deadline()
        reply Ack(txn.zxid)

on Ack(zxid) from F (only if Leading):
    ack_set[zxid] += F
    try_commit(zxid)

try_commit(zxid):
    if zxid <= last_committed: return
    if |ack_set[zxid]| >= quorum:
        last_committed = zxid
        broadcast Commit(zxid)

on Commit(zxid) from L:
    if L is current leader:
        reset_election_deadline()
    if last_committed < zxid <= last_zxid():
        last_committed = zxid

Simulator loop (per tick t in 0..rounds)

1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader  : pick (Leading and synced, lowest id); call propose
3. deliver in-flight           : pop heap entries with delivery_time <= t
4. tick all nodes              : iterate in ascending id; on_tick may fire election or heartbeat

Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Each payload is the byte string "zab-<i>" (plain decimal, no padding). Deterministic, evenly spread, and independent of cluster behaviour.

Wire format (Rpc)

Nine variants. The simulator never serializes RPCs — it passes them as typed values in memory. The only bytes that ever get hashed are the canonical dump.

LookForLeader { src_id, last_zxid, peer_epoch }
Vote          { voter_id, last_zxid, peer_epoch, leader_id }
NewEpoch      { new_epoch }
AckEpoch      { current_epoch, last_zxid }
NewLeader     { new_epoch, history: Vec<Txn> }
AckLeader     { new_epoch }
Propose       { txn: Txn }
Ack           { zxid }
Commit        { zxid }

Canonical dump format

file := magic[8 = "DSEZAB01"] u32_le(node_count) node*

node := u32_le id
        u8     role               # Looking=0, Following=1, Leading=2
        u32_le current_epoch
        u32_le accepted_epoch
        u32_le last_zxid.epoch
        u32_le last_zxid.counter
        u32_le last_committed.epoch
        u32_le last_committed.counter
        u32_le history_len
        txn * history_len

txn  := u32_le zxid.epoch
        u32_le zxid.counter
        u32_le payload_len
        u8 payload[payload_len]

Nodes appear in ascending id order. All multi-byte numbers are little-endian. The dump is hashed with SHA-256; the lowercase hex digest is what zabctl prints (no trailing newline).

Primary-order property

ZAB's defining guarantee, distinct from generic atomic broadcast, is primary order: if a primary (leader) broadcasts proposals p then q in that order, every follower that delivers both delivers p before q. This is enforced trivially by the leader's monotonically increasing next_counter and the follower's txn.zxid > last_zxid() gate on Propose. Primary order is a per-primary property; across leadership changes the guarantee is provided by the Discovery / Sync handshake that explicitly chooses the surviving primary's history.

Cross-language invariants

InvariantWhy it matters
splitmix64 constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EBidentical PRNG output
election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150identical election firing times
delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3identical message scheduling
heap order (delivery_time, sender, seq); seq global monotonicidentical delivery sequence
peers iterated in ascending id (BTreeMap / std::map / explicit loop)identical broadcast order
vote_view keyed by voter id, iterated in ascending ididentical election tally
election tie-break: lexicographic (last_zxid, voter_id)identical leader choice
leader-pick for proposal injection: Leading && synced && min ididentical client routing
proposal schedule (i+1)*rounds/(K+1); payload "zab-<i>" unpadded decimalidentical pending queue contents
propose() calls try_commit()identical last_committed for n=1
Role enum order Looking=0, Following=1, Leading=2identical dump bytes
dump magic "DSEZAB01"; all integers u32 LE; nodes in ascending ididentical dump bytes

If any one of these drifts, scripts/cross_test.sh will fail and cmp -l on the two raw dumps will print the byte offset of the first divergence.


Files

  • src/rust/zab19 crate + zabctl binary.
  • src/go/ — module github.com/10xdev/dse/db19 + cmd/zabctl.
  • src/cpp/db19_lib static library + zabctl binary + test_db19.
  • scripts/verify.sh — runs the unit tests for all three.
  • scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.

db-19 — References

Primary sources

  • Benjamin Reed and Flavio P. Junqueira, A simple totally ordered broadcast protocol, LADIS 2008. The original ZAB paper — short, workshop-length, and the only place that describes the algorithm in the exact "Phase 0 / 1 / 2 / 3" shape it took inside ZooKeeper. https://dl.acm.org/doi/10.1145/1529974.1529978
  • Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini, Zab: High-performance broadcast for primary-backup systems, DSN 2011. The peer-reviewed, formal treatment. Defines the primary order property, gives the proof obligations, and folds the original Phase 0 into Phase 1. This is the paper to cite when arguing the correctness of any particular handshake decision. https://marcoserafini.github.io/papers/zab.pdf
  • Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed, ZooKeeper: Wait-free coordination for Internet-scale systems, USENIX ATC 2010. Describes the system (znodes, sessions, watches, the wait-free API) that ZAB exists to support. Useful for understanding why ZAB was designed with primary order rather than as a generic consensus library. https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf

Implementations to read alongside

Determinism and simulation

  • db-16's references on FoundationDB simulation testing and TigerBeetle apply verbatim here. The (delivery_time, sender, seq) heap and the splitmix64-seeded jitter are the same discipline.
  • The ZooKeeper test suite (zookeeper/src/java/test/.../quorum/) uses scripted scenarios but is not deterministic in the cross-language sense this lab aims for. Worth reading as an example of how the production team tests the algorithm.

Background reading worth doing

  • Heidi Howard, Distributed consensus revised, Cambridge PhD dissertation, 2019; the 2020 survey A Generalised Solution to Distributed Consensus unifies Paxos, Raft, and ZAB under a single quorum-intersection framework. Helps see ZAB as one point in a design space rather than as an oddball. https://www.cl.cam.ac.uk/~hh360/
  • Leslie Lamport, Paxos Made Simple, 2001. The contrast with ZAB is illuminating: Paxos picks a value per slot; ZAB streams a totally ordered log under a primary. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
  • Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014 — the Raft paper. Read this before the ZAB papers if you have not already; the comparison in db-17's CONCEPTS.md is the recommended on-ramp. https://raft.github.io/raft.pdf
  • André Medeiros, ZooKeeper's Atomic Broadcast Protocol: Theory and Practice, Aalto University seminar notes, 2012. A 14-page treatment of ZAB-vs-implementation gotchas; useful when the papers feel terse. https://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

Cross-lab dependencies

  • Upstream:
    • db-16 — distributed-fundamentals: Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale.
    • db-17 — Raft: same simulator skeleton; reading Raft first makes ZAB's discovery/sync handshake feel like the explicit version of Raft's implicit AppendEntries consistency check.
    • db-18 — Paxos: the other consensus reference point; ZAB's (epoch, counter) is the streaming-log analog of Paxos's (ballot, slot) numbering.
  • Downstream:
    • db-20 — Distributed KV: wraps a consensus engine (could be Raft, ZAB, or Paxos from this track) around a key-value state machine.
    • db-21 — Storage-engine-advanced: snapshots and log compaction on top of the canonical history laid down here.
    • db-23 — Capstone: composes the simulator harness across multiple replicated shards.

db-19 — Analysis

Required invariants

  1. Election agreement. At most one node finishes a successful election cycle with role == Leading && synced per epoch. Enforced by majority voting in vote_view plus the strictly increasing pending_new_epoch = max(accepted_epoch, current_epoch) + 1 rule: any competing prospective leader sees a higher accepted_epoch and steps down (via NewEpoch rejection) before it can sync.

  2. Primary order. If a single primary broadcasts proposals p then q, every follower that delivers both delivers p before q. Enforced by the leader's monotonically increasing next_counter (no gaps, no reuse within an epoch) plus the follower's txn.zxid > last_zxid() gate on Propose (out-of-order proposals are silently dropped rather than re-ordered into the log).

  3. Integrity. The leader only proposes once it is synced, and followers only append once current_epoch has been adopted via NewLeader. Followers will not append a Propose whose zxid <= last_zxid(), so a stale leader's late Propose for an already- superseded epoch cannot corrupt a follower's history.

  4. Agreement on committed prefix. If a follower has `last_committed

    = z, every other follower's history contains every txn with zxid <= z(becauseCommit(z)is only broadcast after a quorum has appended every txn up throughz, and a future leader must include any quorum's committed prefix via the Discovery AckEpoch(last_zxid)` reports → it adopts the surviving history).

  5. Total order. All followers deliver committed transactions in the same order (the leader-assigned zxid order). This follows directly from primary order + agreement on committed prefix.

  6. Byte determinism. For every (seed, nodes, rounds, proposals, partition) tuple, the three binaries produce identical canonical_dump bytes — hence identical sha256 hex on stdout. scripts/cross_test.sh checks six scenarios.

Design decisions

  • propose() calls try_commit() at the end. Same single-node argument as db-17 Raft: a one-node cluster is its own majority, no Ack will ever arrive to drive the commit, so the leader must run the quorum check inline. Harmless for n > 1 because the majority check rejects until acks actually arrive.

  • Sorted iteration on every wire-affecting loop. Rust uses BTreeMap / BTreeSet; C++ uses std::map / std::set; Go uses explicit for p := uint32(0); p < n; p++ loops. The Go code also sorts before iterating wherever a map[uint32]... is read for output (the canonical dump and broadcast loops). HashMap would compile and pass single-language tests but fail cross_test.sh immediately.

  • LookForLeader is structurally a Vote for the sender. The Rust handle arm folds LookForLeader directly into the Vote arm. This avoids a separate LookForLeaderReply and gives a late-arriving Looking node the ability to tally an immediate self-vote from the source. The Go and C++ implementations do the same fold.

  • Non-Looking nodes reply to a Vote with their own current vote. An isolated Looking node sending a Vote to peers who are already Following gets back a Vote pointing at the live leader; combined with the lex-update rule, the isolated node converges on the existing leader in O(1) round-trips after partition heal, rather than starting a new election.

  • Vote lex comparison is (last_zxid, voter_id), not (last_zxid, leader_id). The voter's id is the tie-breaker when histories are equal — this is what makes the highest-id node win a cold-start election. Using leader_id instead would create a fixed-point where any node can vote for any leader and the tally never makes progress.

  • pending_new_epoch = max(accepted_epoch, current_epoch) + 1. The max covers the case where this node has previously acknowledged a NewEpoch for a leader that then failed before reaching sync. Without the max, the new leader could pick an epoch that some follower has already rejected, leaving sync stalled forever.

  • AckEpoch is idempotent. A follower that has already adopted accepted_epoch = e replies again on a re-sent NewEpoch(e). This keeps the discovery handshake robust against the heartbeat-driven re-send loop in on_tick while the leader is still gathering acks.

  • NewLeader ships the whole history, not a diff. Following the ZAB paper. For a study lab this is fine; production ZooKeeper uses SNAP / DIFF / TRUNC variants to avoid bulk transfer. Replacing this with a diff would be a substantial change to the RPC layer and is out of scope.

  • Heartbeats re-broadcast the last Commit. Once synced, the leader re-broadcasts Commit(last_committed) every 50 ticks. This doubles as the "leader is alive" signal that resets the follower's election timer. Sending a dedicated Heartbeat RPC would be one more wire variant for no behavioural gain.

  • Proposal schedule is closed-form. schedule[i] = (i+1) * rounds / (K+1) (integer division). Same rationale as db-17: decoupling proposal timing from cluster scheduling decisions keeps the dump bytes from depending on incidental tick alignment.

  • Library + thin CLI. The lab exposes Cluster::new, run, canonical_dump, and sha256 as a library. The CLI is a few dozen lines of arg parsing plus four function calls.

Tradeoffs worth flagging

  • No snapshots, no SNAP/DIFF/TRUNC. The leader sends the full history on every NewLeader. For the bounded rounds of this lab the cost is trivial; for production ZooKeeper it would be prohibitive on large datasets. Snapshots are deferred to db-21.

  • No client sessions, no znodes, no watches. ZAB exists to serve ZooKeeper, but this lab implements ZAB in isolation. The "state machine" is the history vector itself. Anything ZooKeeper-API- shaped (sessions, ephemerals, watches, ACLs) is downstream of the consensus core and lives in a different lab.

  • Crash semantics are stylized. Crashes are simulated only via the partition flag (drop all messages in one direction). A real ZooKeeper must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction.

  • No Observer role. Production ZooKeeper has non-voting Observer servers that learn from the leader but do not participate in quorum. They are pure read-fanout and add no algorithmic complexity, so they were left out of the simulator.

  • No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's cluster_pending queue is drained unconditionally; we are testing the consensus core, not the client RPC layer.

  • Follower truncation is by replacement, not by prefix-match. When a follower receives NewLeader(e, hist), it adopts hist wholesale, even if its own history shares a prefix. This is correct (the leader's history is authoritative for the new epoch) but heavier than necessary; a real implementation would diff.

Why three languages

Same reasoning as db-16 and db-17, plus one new lesson specific to ZAB: the algorithm has two quorum-tracking sets that are easy to get subtly wrong (epoch_acks for discovery, leader_acks for sync, plus the per-zxid ack_set for broadcast). Each set must be iterated in a stable order for the dump, and each must include the leader's own id on initialization. The cross-language test catches both mistakes immediately — forgetting to add self.id to epoch_acks costs a tick of discovery time that perturbs every downstream delivery and changes the dump bytes.

db-19 — Execution

One-shot: prove the lab works

cd db-19-zab
./scripts/verify.sh        # all unit tests in Rust, Go, C++
./scripts/cross_test.sh    # byte-identical sha256 across all three, six scenarios

A green run of cross_test.sh ends with the literal line:

=== ALL OK ===

Per-language workflows

Rust

cd src/rust
cargo test --release       # ~10 tests
cargo build --release      # produces target/release/zabctl
./target/release/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Go

cd src/go
go test ./...              # ~9 tests
go build -o /tmp/zabctl_go ./cmd/zabctl
/tmp/zabctl_go --seed 42 --nodes 3 --rounds 1000 --proposals 5

C++

cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build     # test_db19 — 10 assertions
./build/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

CLI

All three binaries accept the same flags and print lowercase hex sha256 of the canonical dump to stdout with no trailing newline:

flagdefaultmeaning
--seed N0 (Go) / 42 (Rust) / 0 (C++)splitmix64 seed mixed into election timers and message delays
--nodes K3number of ZAB nodes (1 is legal; majority is then 1)
--rounds R0/1000number of simulator ticks to run
--proposals P0number of client commands to inject during the run
--partition s,d,...nonecomma-separated pairs (src, dst) to drop in that direction

(Flag defaults drift between langs because the cross-test script always passes every flag explicitly. Only behavior under explicit flags is part of the cross-language contract.)

--partition 0,1,1,0 drops both directions between nodes 0 and 1 (complete split); --partition 0,1 drops only 0 → 1 (asymmetric). Proposals are spaced as schedule[i] = (i+1) * rounds / (K+1); with --rounds 1000 --proposals 5 they fire at ticks 166, 333, 500, 666, 833 with payloads "zab-0" through "zab-4".

Canonical scenarios

scripts/cross_test.sh runs six scenarios; their sha256s are listed in docs/observation.md. If any change, cross_test.sh will exit non-zero.

labelargs
A--seed 42 --nodes 3 --rounds 1000 --proposals 5
B--seed 7 --nodes 5 --rounds 2000 --proposals 20
C--seed 99 --nodes 3 --rounds 500 --proposals 0
D--seed 1 --nodes 1 --rounds 200 --proposals 5
E--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0
F--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1

D exercises the single-node-leader code path that motivated the propose() → try_commit() call. E isolates node 0 completely; the other two must elect a leader and commit the remaining proposals (the surviving quorum's history is what ends up in node 1 and 2's dump). F is an asymmetric partition that causes term churn but recoverable replication.

Sanity checks

# Pick any scenario and round-trip — the hash is content-defined.
./src/rust/target/release/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
# expect: 16af5aa6dbd5ce09b259755f3339d6cf23966ce115b0e30d9c2990487783047d

# Magic of the canonical dump (use the lib directly; the CLI hashes it):
#   - Rust:  TestDumpDeterministicAcrossRuns asserts da.starts_with("DSEZAB01").
#   - Go:    TestDumpDeterministicAndMagic   asserts the same.
#   - C++:   test_dump_deterministic_and_magic in tests/test_db19.cc.

Tunables (CONCEPTS.md cross-reference)

  • HEARTBEAT_INTERVAL = 50 — leader re-broadcasts last Commit every 50 ticks.
  • ELECTION_TIMEOUT_MIN = 150, ELECTION_TIMEOUT_SPAN = 150 — base + jitter for follower election deadline.
  • DELIVERY_DELAY_SPAN = 3 — message delivery delay is 1 + splitmix64(seed ^ src ^ dst ^ t) % 3 ticks.

Changing any of these changes every canonical hash. The intent is that the lab is a fixed-point study object: the values are part of the contract.

db-19 — Observation

What the cross-language test produces and how to read it by hand.

Expected sha256s

scripts/cross_test.sh runs six scenarios and asserts the three binaries (Rust, Go, C++) all print the same hex digest. The current canonical hashes are:

labelargssha256
A--seed 42 --nodes 3 --rounds 1000 --proposals 516af5aa6dbd5ce09b259755f3339d6cf23966ce115b0e30d9c2990487783047d
B--seed 7 --nodes 5 --rounds 2000 --proposals 20b60388e978a9b98792edb00c8d33217da8bff9945a89d2c0c18b5f69520b91cf
C--seed 99 --nodes 3 --rounds 500 --proposals 08aef7604639fe0f2b349b38d74e10b6da8ac252b626976563bba69c722426296
D--seed 1 --nodes 1 --rounds 200 --proposals 5d4dbb92f91f9a0adf0c4c0b91fa46b2a5145907450897cd6473a02a6279604fd
E--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,05e4dbddb605e469c99fb682c00256445dcb2ed07e984f673d4296ef19719979a
F--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1c9df583bd714534c488aac710e6cc6e57e4b21d2fe96ec17068bd1c7525bc1b3

If any of these change, cross_test.sh will fail. Either you have a bug, or you have intentionally changed the spec (timer constants, schedule formula, dump layout) and you must update this table in the same commit.

What the canonical dump looks like (scenario D — single node)

--seed 1 --nodes 1 --rounds 200 --proposals 5. Five proposals into a single-node cluster — the leader is itself the majority, so every proposal commits immediately and discovery/sync collapse to a no-op (quorum reached on self.id).

offset 0x00 :  44 53 45 5A 41 42 30 31    "DSEZAB01"        magic
offset 0x08 :  01 00 00 00                 1                 node_count
offset 0x0c :  00 00 00 00                 0                 node id
offset 0x10 :  02                          role = Leading (2)
offset 0x11 :  XX XX XX XX                 current_epoch     (== 1 if no churn)
offset 0x15 :  XX XX XX XX                 accepted_epoch    (== current_epoch)
offset 0x19 :  XX XX XX XX                 last_zxid.epoch   (== current_epoch)
offset 0x1d :  05 00 00 00                 last_zxid.counter = 5
offset 0x21 :  XX XX XX XX                 last_committed.epoch
offset 0x25 :  05 00 00 00                 last_committed.counter = 5
offset 0x29 :  05 00 00 00                 history_len = 5
offset 0x2d :  XX XX XX XX                 history[0].zxid.epoch
offset 0x31 :  01 00 00 00                 history[0].zxid.counter
offset 0x35 :  05 00 00 00                 history[0].payload_len = 5
offset 0x39 :  7A 61 62 2D 30              "zab-0"           payload
...

Each subsequent history entry is 4 + 4 + 4 + 5 = 17 bytes (epoch + counter + len + "zab-N"). Total dump for D is therefore 0x2d + 5 * 17 = 0x86 = 134 bytes. Exact bytes depend on whatever epoch the leader has bumped through by the time the run ends; the single-node case is nearly always current_epoch = 1.

A multi-node dump (scenario C — quiet cluster)

--seed 99 --nodes 3 --rounds 500 --proposals 0. No proposals; the cluster elects a leader, runs through discovery + sync, then heartbeats for the rest of the run. Every node's history is empty:

44 53 45 5A 41 42 30 31         magic
03 00 00 00                     node_count = 3

00 00 00 00                     node id 0
XX                              role            (Following or Leading)
XX XX XX XX                     current_epoch   (1 if first election succeeded clean)
XX XX XX XX                     accepted_epoch
00 00 00 00 00 00 00 00         last_zxid       (0, 0)
00 00 00 00 00 00 00 00         last_committed  (0, 0)
00 00 00 00                     history_len = 0

01 00 00 00                     node id 1
... same shape ...

02 00 00 00                     node id 2
... same shape ...

Total dump: 8 + 4 + 3 * (4 + 1 + 4 + 4 + 4 + 4 + 4 + 4 + 4) = 105 bytes. (33 bytes per node with empty history.)

How to debug a divergence

If cross_test.sh fails, write the raw dumps to disk (the CLI prints only the hash; you'll need a one-liner that calls canonical_dump directly, or modify zabctl.rs / main.go / zabctl.cc to dump the raw bytes instead of the hash). Then:

cmp -l /tmp/zab_A_rust.bin /tmp/zab_A_go.bin | head
xxd /tmp/zab_A_rust.bin | sed -n '<line>,+2p'
xxd /tmp/zab_A_go.bin   | sed -n '<line>,+2p'

The first divergence offset tells you what to look at:

offset rangelikely culprit
0x00–0x07magic (typo: DSEZAB01 not DSEZAB1 or DSEZAB02)
0x08–0x0bnode_count (impossible if all three accept --nodes correctly)
inside a node block, on roleenum mapping (Looking=0, Following=1, Leading=2)
inside a node block, on current_epoch / accepted_epochdiscovery handshake bug; the leader's pending_new_epoch likely didn't max() against current_epoch
inside a node block, on last_zxidcounter reset on epoch change wrong (must reset to 0; first new proposal has counter 1)
inside a node block, on last_committedtry_commit quorum count wrong, or propose() not calling try_commit (n=1 case)
inside history_lenfollower Propose filter wrong (out-of-order zxid not dropped), or NewLeader replacement not adopting leader's history
inside a history entrybroadcast loop iteration order — must be ascending peer id

In all six existing scenarios these checks pass; the table above is the runbook for the day someone changes the algorithm and forgets to update one of the three implementations.

Tick-level scope (Rust REPL trick)

To watch a scenario from the inside, add this temporary print in Cluster::run before the per-tick loop body:

#![allow(unused)]
fn main() {
if std::env::var("ZAB_TRACE").is_ok() {
    eprintln!(
        "t={} roles={:?} epochs={:?} commits={:?}",
        t,
        self.nodes.iter().map(|n| n.role).collect::<Vec<_>>(),
        self.nodes.iter().map(|n| n.current_epoch).collect::<Vec<_>>(),
        self.nodes.iter().map(|n| n.last_committed.counter).collect::<Vec<_>>(),
    );
}
}

then run ZAB_TRACE=1 zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5 | head -50. The trace goes to stderr; the canonical dump's sha256 still goes to stdout unchanged. Remove before commit.

Reading the hashes themselves

The hashes are arbitrary — they are SHA-256 of a binary blob whose bytes encode every node's state at the end of the run. There is no way to look at 16af5aa6... and infer anything about the cluster. What matters is that the same input produces the same output in three languages and that the table above doesn't drift unintentionally.

For human-readable insight, dump canonical_dump(&c) to a file and run xxd over it, or print individual node states in a test rather than at the CLI surface.

db-19 — Verification

Prerequisites

  • Rust ≥ 1.74 with cargo on PATH.
  • Go ≥ 1.22 (module declares go 1.22).
  • CMake ≥ 3.20 and a C++17 compiler (Apple clang ≥ 14, gcc ≥ 11).
  • A POSIX sha256sum is not required — each binary computes its own sha256 in-process.

One command

cd db-19-zab
bash scripts/verify.sh && bash scripts/cross_test.sh

Green is === db-19 :: ALL UNIT TESTS GREEN === followed by === ALL OK ===. Anything else is a regression.

What verify.sh does

  1. Rustcargo test --release --quiet over src/rust/. Builds db19 lib + zabctl binary; runs the inline tests in src/rust/src/lib.rs. Expected: test result: ok.
  2. Gogo test ./... over src/go/. Builds the db19 package + cmd/zabctl; runs src/go/zab_test.go. Expected: PASS and ok github.com/10xdev/dse/db19.
  3. C++cmake -DCMAKE_BUILD_TYPE=Release -B build, cmake --build build --target test_db19 zabctl, then ctest --test-dir build --output-on-failure. The test binary ends with Test #1: test_db19 ........ Passed.

If any of these three blocks fails, the script exits non-zero and the rest does not run.

What cross_test.sh does

For each of the six canonical scenarios (A–F) it invokes the three release zabctl binaries with identical flags, captures the lowercase-hex sha256 of the canonical cluster dump, and asserts rust == go == cpp byte-for-byte. The scenarios are:

labelargswhat it exercises
A--seed 42 --nodes 3 --rounds 1000 --proposals 5basic 3-node, 5 proposals, clean network
B--seed 7 --nodes 5 --rounds 2000 --proposals 20bigger cluster, longer horizon
C--seed 99 --nodes 3 --rounds 500 --proposals 0election convergence only
D--seed 1 --nodes 1 --rounds 200 --proposals 5degenerate single node (instant leader)
E--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,03-node with churn
F--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,15-node, asymmetric one-way drop

Canonical hashes are listed in docs/observation.md. The script asserts consistency among the three ports; it is docs/observation.md that pins them to the historical fingerprint.

What green guarantees

  1. Determinism. Same flags ⇒ same canonical dump bytes across languages and runs (modulo endianness — all targets are little-endian). The simulator advances in integer ticks; all map/set iteration is over BTreeMap / sorted Go slices / std::map so the dump order is fixed.
  2. Safety in the modeled environment. No two nodes commit different histories. For every scenario in the suite, after the final tick:
    • last_committed.epoch is monotonic per node.
    • Where two nodes' history overlap by zxid, the bytes match.
    • No follower has committed past last_committed reported by the leader of its current epoch.
  3. Liveness in the modeled environment. Scenarios A, B, D, F include proposals and run long enough to elect a leader and commit them. Scenarios C and E confirm we don't commit what we shouldn't (C has zero proposals; E partitions away the would-be leader so the alternative path must take over).

What green does not guarantee

  • Behavior outside the canonical scenarios. ZAB's state space is large; six fingerprints are an acceptance test, not a model checker. Real validation needs TLA+ (see references.md).
  • Performance. No latency or throughput is measured. Tick count is simulation cost, not wall-clock SLA.
  • Snapshotting / log compaction. Histories grow unboundedly; ZooKeeper truncates via snapshots, which is out of scope here.
  • Production safety primitives — fsync barriers, on-disk checksums, recovery from torn writes, byzantine actors. All deliberately deferred.
  • Real network. Partitions are modeled as a BTreeSet of one-way drops applied at delivery; reordering happens through the simulator's priority queue, not a Lossy/OOO network. There is no actual socket.

Invariant assertions in code

The implementations carry inline assertions where they are nearly free. The load-bearing ones:

WhereAssertionWhat it catches
Leader on_ackrefuse acks for zxids not in our outstanding setduplicate / replayed acks inflating quorum
update_vote (election)only adopt votes with greater (last_zxid, id)non-monotone vote drift
handle_new_epochfollowers must reply only if new_epoch > accepted_epochaccepting a stale epoch from a deposed leader
handle_new_leaderfollowers replace history only if new_epoch > current_epochlosing already-committed entries
canonical_dump writer (all 3 langs)nodes in ascending id, per-node history in ascending zxiddump-writer drift between languages

The unit tests assert each of these on at least one path.

db-19 — Broader Ideas

The lab implements textbook ZAB (epoch + counter, leader-driven broadcast, discovery + sync on leadership change) with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.

Variants and refinements

ZAB-with-snapshots

Production ZooKeeper periodically truncates history by snapshotting the in-memory state machine and dropping txns whose zxid is below the snapshot's. Followers that fall behind the leader's snapshot are fast-forwarded with SNAP (whole-state copy) rather than DIFF (replay tail). Worth implementing as db-19b — it reuses the wire format and adds a Snap { zxid, state_bytes } RPC alongside the existing NewLeader payload.

Fast Leader Election (production form)

Real ZooKeeper's FLE has tie-breaking by peer epoch (the highest epoch this voter has ever seen) before falling back to (last_zxid, id). The lab uses just (last_zxid, id) which is enough for safety but loses an optimization: a node that just lost leadership often still has the highest peer-epoch and should regain leadership quickly. Worth a db-19c.

Observer mode

Observers receive Commit but never vote in elections or quorums. ZooKeeper added them at scale to push read traffic past the voter-set throughput ceiling without inflating quorum sizes. The simulator extension is small: add a Role::Observer, exclude it from quorum counts, still deliver every Commit.

Read-only mode (RO clients during partition)

When a quorum dies but some nodes remain, ZooKeeper exposes those survivors in a read-only mode that serves last-known committed state. This is a useful failure-mode case for the simulator: drop into RO when no quorum responds within an election cycle.

Cross-epoch zxid ordering

Production ZAB stuffs (epoch, counter) into one u64 (32 bits each). The lab uses a struct for clarity; switching to the packed form is a one-line change and would let zxid live in atomic operations on real hardware. Worth a benchmark in db-22.

Production systems to study

Apache ZooKeeper

The canonical implementation. Read the original ZAB paper (Junqueira, Reed, Serafini — Zab: High-performance broadcast for primary-backup systems, DSN 2011) alongside the source in org.apache.zookeeper.server.quorum. The simulator in this lab maps directly onto Leader.java, Follower.java, and FastLeaderElection.java.

Kafka KRaft (Raft replacement for ZooKeeper)

Confluent's argument against keeping ZooKeeper as a dependency was operational: two consensus systems (ZAB for metadata, Kafka's own ISR for log replication) doubled the failure-modes and runbooks. KIP-500 replaces ZAB with a Raft-style log inside Kafka itself. A good real-world counterpoint to read alongside db-17 (Raft).

Curator / Recipes

Apache Curator's "recipes" (locks, leader latches, distributed queues) are layered on top of ZooKeeper. They are a clinic in how not to misuse a primary-order primitive: every recipe pins its watch semantics + retry policy explicitly because ZK ephemeral nodes are not ACID transactions.

Etcd v2 vs v3

Etcd v2 used a ZAB-like broadcast; v3 moved to Raft for the same operational reasons as Kafka. Comparing v2's raft.go (gone, but in git history) and v3's raft/ is instructive — same problem, different state machine, near-identical wire bytes.

Chubby (Google)

Chubby is Multi-Paxos-based, not ZAB, but the lease + session model in ZooKeeper traces directly back to Chubby. Burrows's OSDI 2006 paper is the canonical writeup; read it after this lab and before db-23.

Performance experiments worth running

The simulator's ticks are a unit of cost for comparative experiments:

  • Quorum-size sweep. For nodes ∈ {3, 5, 7, 9}, run proposals = 50 and count ticks to commit the last proposal. Expected: commit cost rises slowly with quorum size (one extra round-trip per added node), election cost rises sharply (vote table doubles).
  • Discovery+sync cost on leadership churn. Vary the partition schedule's --partition density. The lab's E scenario has 4 churn events in 1000 ticks; the more churn, the higher the ratio of NewEpoch/NewLeader bytes to Propose/Commit bytes in the dump. Plot that ratio.
  • Comparison to Raft (db-17) and Paxos (db-18). Same flag surface (--seed --nodes --rounds --proposals --partition) and same scenarios — lab structure is identical on purpose. Compare scenario-A commit latency across the three protocols.

What "production-quality" would require beyond this lab

  • Durable storage. history, current_epoch, accepted_epoch must survive kill -9 and power loss. Real ZooKeeper writes a WAL (see db-03) and snapshots every N transactions.
  • Real network. Sockets, TCP retransmits, framing, TLS, auth. The simulator's OutMsg collapses all of that.
  • Client sessions. ZooKeeper's session-id ↔ ephemeral-node binding is a major protocol surface in its own right; not modeled here.
  • Watches. The pub/sub layer on top of read-paths. Adds a fan-out table and a per-session notification queue.
  • Cluster reconfiguration. Adding/removing voters safely is its own protocol (joint quorum on the membership txn). Out of scope.
  • Recovery from torn writes. Per-page checksums on the WAL.
  • Adversarial inputs. ZAB assumes crash-stop failures only. A Byzantine variant (BFT-ZAB, e.g. BFT-SMaRt) is a separate code base entirely.

db-19 step 01 — Epoch, zxid, and Fast Leader Election

Goal

Build the persistent state every ZAB node carries and the election protocol that picks the next leader when no one is currently broadcasting. Election must converge in bounded ticks for any quorum-available network, and the chosen leader must always be the node with the highest committed zxid in the surviving quorum.

Tasks

  1. ZxId { epoch: u32, counter: u32 } with lexicographic ordering (epoch first, then counter). Provide a ZxId::ZERO constant. Every zxid comparison in the rest of the protocol uses this ordering — never compare the u64 representations directly, because the lab keeps them as a struct for clarity.

  2. Persistent node state. A ZabNode carries:

    • id: u32, n_nodes: u32, quorum = n_nodes/2 + 1.
    • role: Role (Looking / Following / Leading).
    • current_epoch: u32 — the epoch of the leader we last followed. Bumped on NewLeader.
    • accepted_epoch: u32 — the epoch we promised on NewEpoch. Always >= current_epoch.
    • history: Vec<Txn> — committed and uncommitted txns in zxid order.
    • last_committed: ZxId — high-water mark; entries <= this have been applied.

    These are the four values that would survive a crash in a real implementation. Everything else (vote tables, ack tables) is transient and rebuilt on the next election.

  3. Rpc::LookForLeader / Vote. A Looking node broadcasts its current vote each tick. On receiving a peer's vote, update via update_vote(peer.last_zxid, peer.id):

    • Adopt peer as our vote target if (peer.last_zxid, peer.id) > (current_vote.last_zxid, current_vote.id).
    • Record the peer's choice in vote_view[voter_id] = leader_id.
  4. Quorum detection. Walk vote_view and count entries whose value equals each candidate id. The first candidate (in id order) whose count >= quorum becomes the elected leader. If that leader is us, transition to Leading; otherwise transition to Following with leader_id = Some(...).

  5. Election timeout. Track election_deadline per node. If now > election_deadline and we're still Looking, reset the vote table and broadcast a fresh LookForLeader from our current (last_zxid, id). Reseed the deadline with ELECTION_TIMEOUT_MIN + splitmix64(seed) % ELECTION_TIMEOUT_SPAN.

Acceptance

Inline unit tests in each language. Names below are the Rust form; Go uses TestZxIdOrdering style, C++ uses test_zxid_ordering:

  • zxid_ordering_is_lexicographicZxId{0,9} < ZxId{1,0}, ZxId{1,0} < ZxId{1,1}, ZERO < ZxId{0,1}. Locks the comparator.
  • vote_adopts_higher_last_zxid — node 0 with last_zxid=(1,5) votes for itself; receives a vote from node 1 with (2,0); adopts node 1. Then receives from node 2 with (2,0)does not re-adopt (tie on zxid, lower id loses).
  • quorum_of_votes_elects_highest — in a 3-node cluster all voting for node 2, node 2 transitions to Leading after the second matching Vote arrives.
  • election_does_not_decide_in_minority — partition isolates node 0 from {1,2}; node 0 must never leave Looking regardless of how many ticks elapse.

All four green in Rust, Go, and C++.

db-19 step 02 — Discovery, sync, and atomic broadcast

Goal

Layer the steady-state ZAB protocol on top of the elected leader from step 01. The leader must bring every follower's history up to its own before accepting new proposals; once synced, the leader assigns a monotone zxid to each payload and commits it on majority ack. The dump bytes must match across Rust, Go, and C++.

Tasks

  1. Discovery (NewEpoch / AckEpoch). The fresh leader picks new_epoch = max(self.accepted_epoch, max-peer-accepted) + 1 and broadcasts NewEpoch { new_epoch }. Each follower:

    • Asserts new_epoch > accepted_epoch (refuse stale leaders).
    • Sets accepted_epoch = new_epoch.
    • Replies AckEpoch { current_epoch, last_zxid } — the follower's own committed epoch + tail of its history.

    The leader waits for a quorum of AckEpoch (counting itself). At quorum, it knows the highest zxid that any majority node has committed; that becomes the new initial history.

  2. Sync (NewLeader / AckLeader). Leader broadcasts NewLeader { new_epoch, history } where history is the leader's own log (which must include everything any quorum member has acked, by the contract of step 1). Each follower:

    • Asserts new_epoch > current_epoch (refuse historical leaders).
    • Replaces history with the leader's payload.
    • Sets current_epoch = new_epoch.
    • Replies AckLeader { new_epoch }.

    On quorum of AckLeader, the leader broadcasts Commit for every zxid in the synced history that has not yet been committed. The cluster is now in steady state.

  3. Broadcast (Propose / Ack / Commit). Each step tick, if there are queued proposals and we are the leader:

    • Assign zxid = ZxId { epoch: current_epoch, counter: ++last_counter }.
    • Append Txn { zxid, payload } to local history.
    • Broadcast Propose { txn } to all followers.

    Each follower asserts txn.zxid.epoch == current_epoch and txn.zxid > history.last().zxid, then appends and replies Ack { zxid }. The leader tracks acks per zxid in a BTreeMap<ZxId, BTreeSet<u32>>. On quorum (counting itself), it broadcasts Commit { zxid } and advances last_committed.

  4. Apply on commit. Followers receiving Commit { zxid } advance last_committed = max(last_committed, zxid) and (in a real system) apply the txn to the state machine. The lab leaves the state machine implicit — last_committed and history are the only observable surface.

  5. Canonical dump. dump_cluster(nodes) = magic("DSEZAB01") || u32 node_count || dump_node(0) || dump_node(1) || ... where dump_node = id u32 || role u8 || current_epoch u32 || accepted_epoch u32 || last_zxid (epoch,counter) || last_committed (epoch,counter) || history_len u32 || [zxid, payload_len u32, payload bytes] * history_len. All integers little-endian. hash = lowercase hex sha256 of the full byte string, no trailing newline.

Acceptance

Inline unit tests in each language:

  • discovery_bumps_accepted_epoch — leader elected at accepted_epoch=0 broadcasts NewEpoch{1}; followers reach accepted_epoch=1.
  • sync_replaces_follower_history_with_leader_history — follower with stale history receives NewLeader { history: leader_log } and ends with history == leader_log.
  • propose_commits_on_quorum_ack — leader in 3-node cluster proposes one txn; commits after 1 follower ack (leader + 1 = 2 = quorum). The third follower's late ack does not double-commit.
  • propose_does_not_commit_without_quorum — leader in 5-node cluster proposes, 1 follower acks; last_committed stays at ZxId::ZERO.
  • zxid_counter_is_monotone_per_epoch — three proposals get counter 1, 2, 3; if the leader's epoch bumps (next election), counter resets to 1 under the new epoch.
  • canonical_dump_is_byte_stable — same input scenario → same dump → same sha256 across two calls in the same process.

All six green in Rust, Go, and C++.

db-19 step 03 — Cross-language determinism

Goal

Lock the byte-level output of all three implementations (Rust, Go, C++) to the same sha256 for every canonical scenario in scripts/cross_test.sh. This is the difference between "ZAB works in my language" and "ZAB is this exact state machine".

Tasks

  1. Deterministic RNG. splitmix64(u64) -> u64 per the spec:

    x  += 0x9E3779B97F4A7C15
    z   = (x ^ (x >> 30)) * 0xBF58476D1CE4E7B5
    z   = (z ^ (z >> 27)) * 0x94D049BB133111EB
    out =  z ^ (z >> 31)
    

    Every random choice in the simulator (election timeout, delivery delay, partition schedule index) consumes one splitmix64 call on a per-node counter. No language may use its own rand or math/rand or <random> defaults.

  2. Stable iteration. Every map iteration in election, ack tracking, and dump emission is over BTreeMap (Rust), std::map (C++), or a sorted []uint32 (Go). No HashMap / unordered_map / map[uint32] may appear in any code path that affects bytes-on-the-wire or bytes-in-the-dump.

  3. Delivery order. OutMsges enqueued the same tick are delivered in FIFO order per-destination and in source-id ascending order across destinations. Implement with a BinaryHeap<(deliver_at, src_id, seq_no, msg)> (Rust) and the equivalent in Go (container/heap with the same key) and C++ (std::priority_queue). The seq_no tie-breaks duplicates within the same tick.

  4. Partition modelling. --partition a,b,c,d,... is a list of (src, dst) one-way drops. Store as a BTreeSet<(u32, u32)>. At delivery, drop the message if (src, dst) ∈ partition_set. Symmetric partitions are expressed as 0,1,1,0. Single-arg list length must be even (no half-drop); reject odd-length input with exit code 2.

  5. zabctl CLI surface. All three binaries accept:

    zabctl --seed <u64> --nodes <u32> --rounds <u32> --proposals <u32> [--partition a,b,c,d,...]
    

    Print the lowercase-hex sha256 of dump_cluster(...) with no trailing newline. Exit code 2 on any bad flag.

  6. Wire-format magic. First 8 bytes of the dump are the ASCII string "DSEZAB01". Bump to "DSEZAB02" if the layout ever changes (and update docs/observation.md in the same commit).

Acceptance

scripts/cross_test.sh succeeds end-to-end on a clean checkout:

=== ALL OK ===

Each of the six scenarios A–F prints the same hex digest for Rust, Go, and C++. The canonical hashes are pinned in docs/observation.md — if any scenario changes you must update the table in the same commit, with a one-line note on what shifted (timer constant, schedule formula, dump layout).

Optional but valuable: rebuild on a second machine with a different endian-ness-irrelevant compiler (Linux/gcc vs macOS/clang) and confirm the hashes match. All targets in this study back are little-endian; the dump assumes that.

db-20 — Distributed KV Store (Concepts)

This lab is the capstone of the distributed-systems track (db-16..19). It stitches consensus and a state machine into the smallest possible replicated key/value store and exposes the result as a deterministic, byte-identical snapshot across Rust, Go, and C++.

Where it sits

TrackLabProvides
db-16distributed fundamentalsfailure models, CAP, FLP
db-17Rafta real consensus implementation
db-18Paxosa contrast
db-19ZABanother contrast
db-20distributed KVintegration: log + state machine

The scope of db-17 is "implement Raft correctly". The scope of db-20 is "given Raft-shaped semantics, build a replicated state machine you can hash byte-for-byte across three languages." So we deliberately do not re-implement leader election, randomised timers, RPCs, or persistent log files. We model just enough of consensus to study the integration boundary.

Simplifications (vs. real Raft / db-17)

Conceptdb-17db-20
Networkingmessage-drivendirect in-process broadcast
Electionsrandomised timeoutsfixed leader, current_term == 1
Followers' acksRPC replyfunction return
Log replicationnext_index walk on conflictone-shot snapshot push (truncate_and_replay)
Partitionnetwork simulationCluster::partition({ids}) drops messages
Heal / catchupnext-index probesfull log copy + replay
Persistencelog file + fsyncnone (in-memory Vec<LogEntry>)

The simplifications are honest — they collapse implementation details that do not affect the property we care about: a leader's state-machine snapshot is identical to every healthy follower's snapshot, and identical across languages.

Data model

Op            = NoOp | Put(key, value) | Del(key)
LogEntry      = { term: u64, idx: u64, op: Op }
Replica       = { id, log: Vec<LogEntry>, commit_index, current_term,
                  state_machine: BTreeMap<String, Vec<u8>> }
Cluster       = { replicas[5], leader_idx=0, partitions, next_log_idx }

state_machine is the applied projection of the committed log prefix. We do not store tombstones — Del actually removes the entry.

Propose / commit cycle

  1. The leader allocates the next log index and appends LogEntry{term, idx, op} to its own log.
  2. For each follower, in id order:
    • If the follower is partitioned, drop the message.
    • Else call try_append(prev_idx, entry). If the follower's last_idx matches prev_idx, the entry is appended (1 ack). If not, snapshot push: truncate_and_replay(leader.log, leader.commit_index) replaces the follower's log wholesale and re-applies the committed prefix (1 ack).
  3. If acks ≥ quorum (3/5), commit on the leader by advancing commit_index and applying to the state machine. Then in id order, advance every reachable follower's commit_index too.
  4. Otherwise the entry stays in the leader's log uncommitted — a future successful proposal or a heal() will retro-commit it.

The total order of commits is the log order: idx 1, 2, 3, ....

Partition + heal

Cluster::partition({3,4}) adds replica ids 3 and 4 to the partitions set. Subsequent proposals do not message them and do not count their acks. If quorum is still reachable (5 − 2 = 3 ≥ 3), the cluster keeps committing. If not, every proposal returns false and the leader's tail grows uncommitted.

Cluster::heal() clears the set and, for each healed follower in ascending id order, performs truncate_and_replay(leader.log, leader.commit_index). This is db-20's stand-in for Raft's next_index-walk conflict resolution: simpler, deterministic, and good enough for the cross-language exam because the final state is identical.

Cross-language byte identity — the exam

Wire format for one replica's snapshot

magic           "DSEDKV20"           (8 bytes)
u64 LE          commit_index
u64 LE          current_term         (== 1 in this lab)
u32 LE          entry_count
  for each (k, v) in ascending k order:
    u32 LE k_len | k_bytes
    u32 LE v_len | v_bytes
  • Iteration order is ascending key. Rust uses BTreeMap, C++ uses std::map — both naturally ascending. Go's map iteration is randomised, so the Go implementation does an explicit sort.Strings before serialising.
  • Tombstoned keys are not in the dump.
  • All integers little-endian.

Workload spec

splitmix64 constants: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB

setup:
  Cluster::new(5)
  if scenario == "partition":
    at op = ops/4   → partition([3, 4])
    at op = ops*3/4 → heal()

for op in 0..ops:
  r1, r2, r3 = rng.next() ×3
  kind = (r1 >> 62) & 0x3                # 0,1,2 → Put; 3 → Del
  k    = "k" + (r2 % keys).to_string()
  v    = u64_le(r3 % 10000)              # 8 bytes

Frozen golden hashes

ScenarioArgssha256
A--seed 42 --ops 500 --keys 32 --scenario default1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b
B--seed 7 --ops 2000 --keys 128 --scenario partition272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc

These are baked into scripts/cross_test.sh, src/go/dkv20_test.go, and src/cpp/tests/test_dkv20.cc. Any change to PRNG / wire format / op decoding / partition timing / snapshot push will break them — which is exactly the point.

Where to look next

  • src/rust/src/lib.rs — the reference implementation. Read it first.
  • src/go/dkv20.go — port. Note the explicit sort.Strings before writing wire bytes.
  • src/cpp/src/dkv20.cc — port. Note the manual little-endian writers and pure-stdlib sha256.
  • docs/ — the long-form study notes (analysis, execution, observation, verification, broader ideas).
  • steps/ — the three-step study plan if you are walking the lab fresh.

db-20 — References

Distributed-system foundations and the specific consensus / replication ideas that informed this lab.

Consensus

CAP / consistency models

  • Brewer, E. Towards Robust Distributed Systems (PODC 2000 keynote).
  • Gilbert, S. & Lynch, N. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT 2002.
  • Vogels, W. Eventually Consistent. CACM 2009.

Transactional storage

  • Gray, J. & Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993. Chapter 7 on replicated data.
  • Mohan, C. et al. ARIES. ACM TODS 17(1), 1992 — background on why our log is append-only.

State-machine replication

  • Schneider, F. B. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Comp. Surveys 22(4), 1990.

Production systems for comparison

Self-references in this repo

  • db-16-distributed-fundamentals/ — failure models, CAP/FLP intuitions.
  • db-17-raft/ — the underlying consensus algorithm.
  • db-09-leveldb-complete/ — the storage-engine quality bar this lab matches.

db-20 — Analysis

What is the question?

Given Raft-shaped consensus semantics, can we build a replicated state machine that produces a byte-identical snapshot across three language ecosystems? "Byte-identical" is the strongest possible test of cross-language conformance — strings, integers, map iteration order, and op semantics all have to line up.

Why is this an interesting study?

Raft on its own (db-17) tells you nothing about how a real key/value store is layered on top of it. Production systems (etcd, TiKV, CockroachDB) all answer the same questions:

  1. What does the leader send to followers? (log entries)
  2. When does a follower apply an entry? (when its commit_index advances)
  3. How does a partitioned follower catch up? (next-index probe / install snapshot)
  4. What invariants does the state machine maintain across replicas?

db-20 strips out the network and timer noise so we can focus on questions 2–4 alone. The simplification turns out to be the whole point: once you stop worrying about elections, the integration story fits in ~400 lines of Rust.

Design choices and trade-offs

Snapshot push instead of next-index walk

Raft's real conflict resolution is "decrement next_index, retry". For our purposes that produces the same final state as a one-shot snapshot push, but it forces us to model RPC round-trips. We pick the snapshot push because:

  • it converges in a single step (deterministic), and
  • it makes heal() trivial to write — just truncate and replay.

The cost: we cannot study log-divergence scenarios where two leaders both append. That's fine: this lab is single-leader by construction.

State machine is BTreeMap<String, Vec<u8>>

A sorted map gives free deterministic iteration in Rust and C++. Go's map has randomised iteration, so the Go implementation explicitly sorts before serialising. This is the single most common source of non-determinism in cross-language ports — every wire-format-aware function in the Go code does sort.Slice or sort.Strings.

Op encoding inside LogEntry is not wire-stable

The log is in-memory only; we never serialise LogEntry itself. Cross-language byte identity is only required at the snapshot boundary. This separation of "internal" and "wire" structures is cheap discipline that scales to real systems.

current_term is in the snapshot but is always 1

We expose current_term in the wire format anyway, plumbed through to all three implementations. This makes it cheap to add elections later (e.g. as an extension exercise) without having to bump the magic.

Failure-mode catalogue (what we covered, what we did not)

Failuredb-17 covers?db-20 covers?
Single follower crash + catchupyesyes (heal)
Network partition isolating minorityyesyes
Leader crash + new electionyesno (fixed leader)
Split-brain after partition healyesno (no elections)
Log compaction / snapshot installscratched the surfaceno
Disk-loss / log truncationnono
Byzantine behaviournono

Where to take this next

  • broader-ideas.md lists the explicit extensions: linearizable reads, log compaction, multi-region replicas, learner replicas, snapshot install over the wire, gossip-style cluster membership.
  • The exam in cross_test.sh doubles as a regression net for any of those extensions — break the snapshot bytes, you break the build.

db-20 — Execution Plan

Stage 1 — Single replica, no replication

Implement Replica and apply() for Op::{NoOp, Put, Del} in Rust. Verify that a Cluster::new(1) (1-replica cluster — trivially has its own quorum) can propose(Put("a", b"v")) and the state machine sees a → b"v". Test cases 1 and 3 in the Rust suite.

Stage 2 — Five replicas, no failures

Add Cluster, propose, quorum = N/2 + 1. Verify that a single propose on a 5-replica cluster applies to all five state machines because all 5 follow the leader. Tests 2 and 6.

Stage 3 — Partitions

Add Cluster::partition(ids) and is_partitioned. Drop messages to and from partitioned replicas. Test that:

  • 3/5 reachable still commits (test 4),
  • 2/5 reachable does not commit (test 3),
  • partitioned followers have commit_index == 0 after one proposal (test 4).

Stage 4 — Heal + catchup

Implement Cluster::heal and Replica::truncate_and_replay. Verify that after a sequence of mutations on healthy replicas, calling heal() brings the partitioned ones back to byte-identical snapshots. Tests 5 and 13.

Stage 5 — Canonical snapshot

Decide the wire format (see CONCEPTS.md), implement dump_state, write the byte-format test that pins every field offset (test 8). The test fails loudly if a future refactor changes endianness or field order.

Stage 6 — Workload driver

Port splitmix64 (mix and stateful generator). Decode each r1 high-bit pair into Put/Del. Encode r3 % 10000 as a fixed 8-byte LE value so the byte width is independent of host word size. Tests 9, 10.

Stage 7 — Cross-language exam

Build the Rust binary, capture the actual hash for scenarios A and B, bake those hashes into src/go/dkv20_test.go, src/cpp/tests/test_dkv20.cc, and scripts/cross_test.sh. Port Go. Port C++. Run bash scripts/cross_test.sh and watch all three values align.

Stage 8 — Verification + docs

scripts/verify.sh runs all three test suites. scripts/cross_test.sh runs all three binaries on both scenarios. Doc trio (analysis, execution, observation, verification, broader-ideas) plus three steps/ study files.

Pitfalls to expect

SymptomLikely cause
Go scenario hash doesn't match Rustunsorted map iteration in DumpState
C++ scenario hash doesn't match Rustendian / size mismatch in put_u32_le / put_u64_le
C++ tests pass in Debug, fail in Releaseassert(side_effect) — Release strips it
Wrong commit_index after partition healsnapshot push not clearing state_machine
Build error: duplicate package declarationcreate_file leftover from a stub
Subagent left half-built portsresume manually, hash will tell you if it works

db-20 — Observation

Frozen exam hashes

ScenarioArgssha256
A--seed 42 --ops 500 --keys 32 --scenario default1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b
B--seed 7 --ops 2000 --keys 128 --scenario partition272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc

These three statements are all asserted by scripts/cross_test.sh:

  1. Rust, Go, and C++ each produce the hash above for the given scenario.
  2. All five replicas in the cluster produce the identical snapshot after the scenario completes (TestScenarioBReplicasConverge in Go, test_scenario_b_frozen in C++, scenario_b_partitioned_replicas_converge_after_heal in Rust).
  3. The cluster has no live partitions when the driver returns.

Quantitative observations

metricscenario Ascenario B
ops5002000
keys parameter32128
committed-on-leader entries5002000
approximate Put / Del fraction3/4 vs 1/43/4 vs 1/4
live keys in final state (approx)< 32< 128

Every committed entry executes exactly once on the leader; partitioned followers see all of them after heal() because truncate_and_replay replays the entire log.

Behavioural observations

  • Convergence is deterministic. No timeouts, no clocks. Running the workload twice with the same seed always yields the same bytes (workload_determinism in Rust, TestWorkloadDeterminism in Go, test_workload_deterministic in C++).
  • Sub-quorum proposals leave uncommitted tails. The leader's last_idx advances every propose, but commit_index only advances on quorum acks. This is observable in the test sub_quorum_does_not_commit — the leader sees last_idx == 1 and commit_index == 0.
  • Heal is a one-shot. In scenario B, after the heal call at ops*3/4, all five replicas have byte-identical state machines. There is no period of eventual consistency — convergence is instantaneous and deterministic by construction.
  • Delete is real. Del removes the key from the state machine. A later Put reusing the same key is a fresh entry, not a "revive". This is asserted by test_del_removes_key (C++) and friends.

Performance notes (this lab is not a perf study)

The reference implementations are single-threaded, in-memory, with no I/O. Scenario B runs in ~5 ms in Release Rust on an M-series Mac; the snapshot push during heal() is O(log_size) per partitioned follower, which is the dominant cost.

The lab deliberately optimises for clarity and byte-identity, not throughput. Real systems (db-09 leveldb-complete is a good adjacent reference) batch and pipeline replication; here every propose is synchronous.

db-20 — Verification

Three layers of test

1. Per-language unit tests

FileTestsCovers
src/rust/src/lib.rs mod tests14splitmix64, sha256, single replica, quorum, sub-quorum, partition, heal, convergence, del, byte format, determinism, scenarios A/B, snapshot push, NoOp
src/go/dkv20_test.go15same set + an extra stdlib sha256 cross-check
src/cpp/tests/test_dkv20.cc13same set

Run with:

( cd src/rust && cargo test --release )
( cd src/go && go test ./... )
( cd src/cpp/build && cmake --build . && ctest --output-on-failure )

scripts/verify.sh is the one-shot wrapper for all three and ends with === OK ===.

2. Cross-language byte-identity exam

scripts/cross_test.sh builds three clusterctl binaries (Rust, Go, C++) and runs:

clusterctl workload --seed 42 --ops 500  --keys 32  --scenario default
clusterctl workload --seed 7  --ops 2000 --keys 128 --scenario partition

The script asserts: rust_hash == go_hash == cpp_hash == golden_hash for each scenario. Ends with === ALL OK ===. Failure on any line exits non-zero and prints the diverging hashes.

3. Frozen golden hashes baked into source

The golden values are duplicated in three places on purpose:

  • scripts/cross_test.sh
  • src/go/dkv20_test.go (hashA, hashB constants)
  • src/cpp/tests/test_dkv20.cc (string-literal in test_scenario_a_frozen and test_scenario_b_frozen)

A change to the wire format or workload spec must update all three to keep verify + cross_test green. The redundancy makes silent drift impossible.

Sanity-check invariants (asserted by the tests)

  • Sha256Hex(empty) and Sha256Hex("abc") match the canonical SHA-256 test vectors.
  • Splitmix64Mix(0) == 0x8b57dafca0cee644 in all three languages.
  • DumpState for one Put produces exactly 38 bytes whose layout is pinned byte-by-byte.
  • NoOp advances commit_index but leaves the state machine empty.

What this exam does NOT verify

  • Real persistence (no log file, no fsync).
  • Real elections (leader is fixed; current_term == 1).
  • Real RPC failure injection (we model partitions only).
  • Linearizable read paths (reads are direct map lookups).

Those are deliberate scope cuts — see analysis.md and broader-ideas.md.

db-20 — Broader Ideas

The lab is deliberately small. Here is the menu of extensions that keep the same cross-language exam structure intact.

Elections

Add a term bump path. Replace fixed leader_idx = 0 with a leader elected by randomised timeout (or a deterministic priority list, if you want to keep cross-language byte identity). The snapshot already serialises current_term, so the wire format does not need to change. New invariant to test: after a leader change, every healthy replica still converges to the same snapshot.

Linearizable reads

Currently reads are direct state_machine.get(k). To make them linearizable, gate every read through a "read index" — leader confirms it is still leader by exchanging heartbeats with a quorum, then returns the value at commit_index. The byte-identity exam stays the same; you add a TestReadIndexBlocksUntilQuorum-style scenario.

Log compaction / snapshot install

Today heal() ships the entire log. For long-running clusters that is unbounded. Add Replica::compact(up_to_idx) that drops the prefix and records a CompactedSnapshot at the head; change try_append to also accept "follower has snapshot_idx == prev_idx - delta". The exam scenarios still pass because the applied state is unchanged.

Multi-key transactions

Replace Op::Put with Op::Txn(Vec<KeyOp>) and apply atomically. This is a small, well-scoped extension that nudges the lab toward db-13 (transactions and MVCC) territory.

Membership changes

Add a JointConsensus op (Raft §6) that switches the cluster's quorum during a configuration change. Trickier — the snapshot needs to include the active config — but a worthy follow-on if you want to see why "just add a node" is a real problem.

Disk persistence

Persist the log to a file (use db-01's pwrite-and-fsync pattern). Test crash recovery by tearing down a replica and reconstructing it from the log file. The snapshot bytes do not change.

Learner replicas

Add a replica role that receives entries but does not count toward quorum. Useful for read scaling. The snapshot bytes do not change.

Gossip-style membership

Replace the static replica list with a SWIM-style gossip layer that discovers and evicts replicas. Far more invasive — at this point you are building etcd.

Bridges to other labs in the repo

ExtensionBuilds on which other lab?
Disk persistencedb-01 (storage primitives), db-03 (WAL)
Linearizable readsdb-16 (distributed fundamentals)
Multi-key transactionsdb-13 (transactions and MVCC)
Compaction / snapshotdb-09 (leveldb), db-21 (storage advanced)
Real elections + RPCsdb-17 (raft)
Multi-region / quorum mixdb-22 (perf & benchmarking)

Step 01 — Cluster and Replica

Goal: in ~30 minutes, build a single-replica Cluster::new(1) that accepts Put / Del and returns a state machine you can inspect.

What to read first

  • CONCEPTS.md § "Data model" and § "Propose / commit cycle".
  • db-17-raft/CONCEPTS.md for the words log entry, commit index, quorum if they are not yet second nature.

Concrete tasks

  1. Define OpKind, Op, LogEntry, Replica, Cluster in the language of your choice. Match the field layout from src/rust/src/lib.rs.
  2. Implement Replica::last_idx, Replica::try_append, Replica::advance_commit_to. Note that apply(state_machine, op) is the only place where state_machine mutates outside of truncate_and_replay.
  3. Implement Cluster::new(n), quorum, propose. For now, treat every reachable follower as a successful append (no NACK path yet).

Definition of done

#![allow(unused)]
fn main() {
let mut c = Cluster::new(1);
assert!(c.propose(Op::Put("a".into(), vec![1,2,3])));
assert_eq!(c.leader().state_machine.get("a"), Some(&vec![1,2,3]));
}

equivalents pass in Go and C++. Run cargo test single_replica_put_commits to confirm.

Common bugs at this stage

  • Forgetting to bump next_log_idx so two proposals get the same idx.
  • Applying op before the entry is committed.
  • Iterating an unsorted map somewhere (Go) — even at this stage, start the habit of sort.Strings(keys) before any deterministic output.

Step 02 — Quorum Replication

Goal: extend the cluster to 5 replicas and make the leader commit only when a majority acks.

What to read first

  • docs/execution.md Stages 2 and 3.
  • The propose loop in src/rust/src/lib.rs (the part that calls try_append and counts acks).

Concrete tasks

  1. Implement Cluster::partition(ids) and is_partitioned. Store partitioned ids in a sorted set so iteration order is stable.
  2. In propose, count acks only from non-partitioned, non-leader replicas (plus one for the leader if the leader itself is not partitioned). If acks < quorum, return false and leave the entry uncommitted.
  3. Write the three quorum tests:
    • 5/5 reachable → commit on all.
    • 3/5 reachable → commit on the reachable three.
    • 2/5 reachable → no commit anywhere.

Definition of done

#![allow(unused)]
fn main() {
let mut c = Cluster::new(5);
c.partition(&[2, 3, 4]);
assert!(!c.propose(Op::Put("k".into(), b"v".to_vec())));
assert_eq!(c.leader().last_idx(), 1);
assert_eq!(c.leader().commit_index, 0);
}

passes. The Go and C++ ports must match.

Common bugs at this stage

  • Counting partitioned followers' acks anyway (a follower in the partitions set must contribute zero acks).
  • Counting the leader twice (once for i == leader_idx, once for acks = 1).
  • Advancing commit_index on the leader but not on the followers.
  • Mutating state_machine before the commit check passes.

Step 03 — Partitions and Catchup

Goal: implement heal() and truncate_and_replay so a partitioned follower can rejoin and converge. Then ship the cross-language exam.

What to read first

  • CONCEPTS.md § "Partition + heal".
  • docs/execution.md Stages 4–7.
  • src/rust/src/lib.rs — the truncate_and_replay and Cluster::heal bodies.

Concrete tasks

  1. Implement Replica::truncate_and_replay(leader_log, leader_commit): replace own log, wipe state machine, replay committed prefix.
  2. Implement Cluster::heal():
    • clone leader log + commit index up front (avoid use-after-mutate),
    • clear partitions,
    • for each previously-partitioned follower in ascending id order, call truncate_and_replay.
  3. In propose, when try_append returns false (gap), do a snapshot push immediately and count the ack.
  4. Implement dump_state per the wire format in CONCEPTS.md. Pin every byte offset in a test (test_snapshot_byte_format).
  5. Port the workload driver (run_workload) to all three languages. The byte-decoding rules — kind = (r1 >> 62) & 0x3, k = "k" + …, v = u64_le(r3 % 10000) — must be identical across all three.
  6. Build Rust binary, run scenarios A and B, capture the hashes, bake them into Go test, C++ test, and scripts/cross_test.sh.
  7. Bring Go and C++ green: run scripts/cross_test.sh. It must end with === ALL OK ===.

Definition of done

bash scripts/verify.sh        # → "=== OK ==="
bash scripts/cross_test.sh    # → "=== ALL OK ==="

Both scenarios produce:

  • A: 1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b
  • B: 272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc

Common bugs at this stage

  • heal() reads leader.log after mutating a follower — use a snapshot variable.
  • dump_state in Go iterates the map directly → randomised hash. Fix: sort the keys.
  • dump_state in C++ uses strcpy(magic_buf, "DSEDKV20") and copies 9 bytes including the NUL. Fix: std::memcpy(buf, MAGIC.data(), 8).
  • C++ test passes in Debug, fails in Release because assert(c.propose(...)) got stripped. Fix: assign to a bool ok = ... first, then assert(ok).
  • CLI prints a trailing newline. The exam compares full lines; a trailing \n breaks the hash comparison.

Advanced Storage Engine

Lab status: complete. All unit tests pass; scripts/cross_test.sh proves three independent implementations (Rust, Go, C++) produce byte-identical canonical wire dumps for three fixed workloads.

1. What Is It

A standalone study of two pieces that turn a textbook LSM tree into something closer to RocksDB / LevelDB strength:

  1. Range tombstones — a single record that logically deletes every key in a half-open interval [start, end), instead of writing one Delete per key.
  2. Compaction policiessize-tiered (Cassandra/Scylla heritage) and Universal (RocksDB's flagship), expressed as deterministic, side-effect- free functions over the sequence of SSTs.

Everything else (block cache, bloom layout, manifest, WAL, MVCC) is held at its simplest possible form so the two ideas above are studied in isolation.

2. Why Care

  • Range tombstones make DELETE FROM t WHERE id BETWEEN ?, ? and TTL-style "drop everything older than X" affordable. Without them, a one-million-key range delete writes one million Delete entries — and worse, blocks all subsequent reads until those tombstones reach the bottom level.
  • Compaction policies are the single biggest knob in any LSM. Size-tiered minimises write amplification at the cost of read amplification; Universal bounds the SST count while preserving recency. Choosing one is choosing the workload shape you'll be good at.

3. Core Data Structures

TypePurpose
Entry { Put(k,v) | Delete(k) }The point-write unit.
RangeTomb { start, end }Half-open interval delete; start ≤ k < end.
SimpleBloom (u64)64-bit single-word bloom; two FNV-1a positions.
Sst { smallest, largest, entries, range_tombstones, bloom }An immutable run.
LsmTreeAdvanced { ssts (newest first), ratio }The whole tree.

Sst::size() = entries.len() + 2 * range_tombstones.len(). The ×2 weight on tombstones is deliberate: it makes compaction more eager when tombstones pile up, which matches real-world tuning advice.

4. The Six Algorithms in One Page

  1. Build SST. Walk pending entries right-to-left, mark each key's last occurrence as keep. Then walk left-to-right emitting kept entries (preserves insertion order of survivors). Compute smallest/largest and the bloom in the same left-to-right pass. Range tombstones are copied verbatim.

  2. Get(key). Walk SSTs newest → oldest, accumulating active tombstones into a vector as we go. For each SST:

    • append its tombs to active,
    • if any tomb in active covers key, return None (early exit),
    • if bloom misses, continue (bounds and entries skipped, but older SSTs may still contain covering tombs — so the walk continues),
    • if key < smallest or key > largest, continue for the same reason,
    • linear scan entries; first hit returns Some(v) for Put, None for Delete.
  3. Size-tiered compaction. Pick the longest prefix L ∈ [2, n-1] such that Σ size(ssts[0..L]) ≤ ratio · size(ssts[L]). Merge that prefix into one SST and insert it at the newest position (index 0). If no such L exists, return false.

  4. Universal compaction. Pick the longest contiguous run [i, i+L) with L ≥ 3 such that Σ size(run) ≤ ratio · size(ssts[i+L]) (i.e. the run is followed by something at least twice as big). Ties broken by smaller i. Merge the run, replace it in place.

  5. Merge. Walk the run newest → oldest. For each SST:

    • copy its range tombs into out_tombs (preserved verbatim),
    • for each entry: skip if seen[key] (newer-wins), skip if covered by any previously active tomb, otherwise keep it; mark seen,
    • append the SST's range tombs to active.

    Finally sort out_entries by key for determinism and recompute the bloom + bounds.

  6. Dump (canonical wire format). A length-prefixed binary blob, little- endian throughout. Magic "DSEADV21"f64 ratiou32 sst_count ‖ per SST: lenpref(smallest) ‖ lenpref(largest) ‖ u32 nE ‖ entries (u8 kind ‖ lenpref(key) ‖ if Put: lenpref(val)) ‖ u32 nT ‖ tombs (lenpref(start) ‖ lenpref(end)) ‖ u64 bloom.

5. What's Deliberately Not Here

  • No WAL — recovery is out of scope; the tree is in-memory.
  • No block cache, no separate index/filter blocks — the bloom is one u64.
  • No level structure — SSTs are a flat list, newest first.
  • No snapshots / MVCC — Get is point-in-time only.
  • No concurrency — everything is single-threaded; SSTs are immutable so reads-with-merges would be trivially safe.

These omissions are why the lab fits in three files per language while still exercising the two ideas (range tombstones, compaction policy) at the depth where their subtleties bite.

6. Pointers to Cross-Language Equivalence

The whole point of the lab is that three independent implementations agree byte-for-byte, not just at API level. The shared deterministic workload (SplitMix64 PRNG, three draws per op, fixed flush/compact cadence) and the canonical wire format (Section 4.6) are the two halves of that contract. scripts/cross_test.sh enforces it with three hard-coded sha256 fixtures.

See docs/execution.md for the format spec, docs/verification.md for the expected output of the verification scripts, and docs/analysis.md for the design forces behind both range tombstones and the two compaction policies.

References — db-21

The lab is intentionally self-contained, but the ideas are not original.

Range Tombstones

  • RocksDB wiki, "DeleteRange: A New Native RocksDB Operation" https://rocksdb.org/blog/2018/11/21/delete-range.html
  • CockroachDB blog: "DeleteRange and the importance of tombstones in a distributed SQL database."
  • "Bringing Modern Hierarchical Memory Systems Into Main-Memory Databases" (Bortnikov et al.) — discusses interval-deletion structures.

Compaction Policies

  • "The Log-Structured Merge-Tree (LSM-Tree)", O'Neil et al., 1996. The canonical paper; introduces the size-tiered idea.
  • RocksDB wiki, "Universal Compaction Style" https://github.com/facebook/rocksdb/wiki/Universal-Compaction
  • RocksDB wiki, "Leveled Compaction" https://github.com/facebook/rocksdb/wiki/Leveled-Compaction
  • ScyllaDB docs, "Size-tiered compaction strategy (STCS)" — the Cassandra heritage of size-tiered.

SplitMix64

  • Steele, Lea, Flood, "Fast Splittable Pseudorandom Number Generators", OOPSLA 2014. The mixing constants 0x9E37..., 0xBF58..., 0x94D0... come straight from this paper.

FNV-1a

  • Glenn Fowler, Landon Curt Noll, Phong Vo, "FNV non-cryptographic hash." The 64-bit offset 0xCBF29CE484222325 and prime 0x100000001B3 are the standard FNV-1a parameters.

Cross-Language Byte Equivalence as a Methodology

  • TigerBeetle's "Tigerstyle" notes on deterministic simulation.
  • FoundationDB's flow-based testing — the closest commercial analogue to "spec-by-hash-of-canonical-dump".

Analysis — db-21 Advanced Storage Engine

1. Problem Statement

Two engineering questions, studied in isolation:

  1. Range deletes. How does an LSM delete a key range [a, b) cheaply, without writing one Delete per key, and without losing correctness if a newer Put falls inside the same range?
  2. Compaction policy. How do size-tiered and Universal compaction actually differ — not as marketing words, but as deterministic functions over the current SST sequence?

The lab refuses to answer these with prose alone. It demands an executable specification that three language ports must agree on byte-for-byte.

2. Why Three Languages

Cross-language byte agreement is the cheapest sanity check that survives refactoring. If Rust drifts from Go on fixture A, the failure tells you exactly which side broke: the diff between the two dump_state() blobs is a structured binary, decodable by eye.

It also forces the design through three different idiom sets:

  • Rust keeps Option<Vec<u8>> for Get, enum Entry { Put, Delete } for the entry kind, and uses Vec<u8> everywhere for keys/values. Slices for bounds; no copies in merge_run's hot path.
  • Go uses []byte plus bytes.Compare. A map[string]struct{} stands in for the dedupe set. math.Float64bits for the ratio encoding.
  • C++ uses std::string as a byte container (avoids the char_traits trap), std::optional<std::string> for Get, and an inline 64-line SHA-256 in lsmctl.cc to keep the dependency surface at zero.

If you can read the same algorithm in all three and they line up at the byte level, the algorithm is unambiguous. That's the deliverable.

3. Range Tombstones — The Subtlety

The single non-obvious rule is:

A range tombstone hides keys older than it, but is itself hidden by Puts newer than it.

Both halves matter. Test range_tomb_respects_newer_put exists because a naive implementation that consults all tombstones before walking entries will silently drop the fresh value.

The implementation enforces this by walking SSTs newest → oldest and accumulating active tombstones as the walk descends. A Put in a newer SST is checked against the (then still empty) active set, so it survives. A Put in an older SST is checked against the (by then populated) active set, so it is hidden.

This also explains why a bloom miss must continue instead of return None: the SST we just skipped might have zero matching keys, but it could still contribute a range tombstone that shadows something below it. The active set must be allowed to grow.

4. Size-Tiered vs Universal — The Real Difference

Both are "merge several SSTs into one". The difference is which several.

  • Size-tiered asks: "is there a prefix [0..L) of new, small SSTs that together fit within ratio · size(ssts[L])?" It picks the longest qualifying prefix, merges them, and inserts the result at the newest position. This is greedy from the top of the tree.

  • Universal asks the same shape of question, but over a contiguous run anywhere in the list, with a minimum run length of 3. It picks the longest run; ties go to the leftmost. The merged run replaces itself in place.

The minimum lengths (≥ 2 prefix for tiered, ≥ 3 run for universal) are deliberate, both to keep work amortised and to make the two policies distinguishable on small inputs. Without them, both would degenerate to "merge whenever you can" and the fixtures wouldn't separate them.

5. Why the Wire Format Looks Like That

Five choices, each with a reason:

  1. Magic "DSEADV21" — eight bytes, no length prefix. Mismatches surface as the first 16 hex chars of the sha256 changing, which is easy to spot.
  2. f64 ratio — encoded via the IEEE 754 bit pattern, not as a string. This is why all three languages route through f64::to_bits, math.Float64bits, and memcpy(&u64, &double, 8). Strings would force a formatter choice ("0.5" vs "0.5000000000000000").
  3. Length-prefixed keys/valuesu32 LE lengths, raw bytes. No terminator, no escaping. Decoding is a one-pass scan.
  4. Entries newest-SST-first — matches the in-memory layout; reversing it in the dump would obscure the actual data structure.
  5. Bloom as raw u64 LE — not a list of positions. The bitmap is the bloom; nothing else needs to be portable.

6. Trade-offs Not Taken

  • We did not implement snapshot reads. Every Get is "as of right now".
  • We did not deduplicate range tombstones across SSTs at merge time. A range that fully contains an older range still leaves both in the merged output. Real engines coalesce; we don't, because the canonical-bytes test would then depend on a chosen normalisation policy.
  • We did not gate compaction on actual work performed; size-tiered may pick a length-2 prefix even when the merge produces zero entries (after tombstones erase everything). That's a feature for the study lab — it exercises the merge code; in production you'd skip empty merges.

Execution — db-21 Wire Format and Workload

This document is the single source of truth for the canonical wire format and the deterministic workload. Anything ambiguous here is a bug; fix the doc, not the implementations.

1. SplitMix64 PRNG

state += 0x9E3779B97F4A7C15
z      = state
z      = (z XOR (z >> 30)) * 0xBF58476D1CE4E5B9
z      = (z XOR (z >> 27)) * 0x94D049BB133111EB
return   z XOR (z >> 31)

All multiplications are unsigned 64-bit, wrapping on overflow. The PRNG is seeded with the user-supplied 64-bit seed. Three draws happen per op, even when only one or two are used — keep them in order (r1, r2, r3).

2. Operation Selection

op = (r1 >> 62) & 0b11
opAction
0, 1Put(k = "k" + (r2 mod keys), v = u32_be(r3 as u32))
2Delete(k = "k" + (r2 mod keys))
3RangeTomb(start = "k" + a, end = "k" + (a + 1 + (r3 mod (keys-a)))) where a = r2 mod keys

In scenario ptonly, op 3 is rewritten to op 0 before the action runs. The three draws still happen.

The value bytes are the big-endian 32-bit representation of r3 truncated to 32 bits. (Big-endian because it produces visually distinct bytes across fixtures; the format is otherwise little-endian.)

3. Flush and Compact Cadence

  • Every 8 ops (i.e. when (op_idx + 1) % 8 == 0): flush all pending entries and tombstones into a new SST at the newest position.
  • Every 16 ops (i.e. when (op_idx + 1) % 16 == 0): run one compaction pass appropriate to the scenario (size-tiered, universal, or no-op).
  • No residue flush at end. If the loop ends with non-zero pending entries, they are discarded. This is intentional: it keeps the cross-language hashes stable regardless of ops mod 8.

4. Canonical Wire Format

All integers little-endian. lenpref(b) means u32 LE len(b) ‖ b.

"DSEADV21"               (8 bytes, ASCII, no terminator)
f64 LE ratio             (IEEE 754 bit pattern, not a string)
u32 LE sst_count
for each SST (newest first):
    lenpref(smallest_key)
    lenpref(largest_key)
    u32 LE entry_count
    for each entry:
        u8 kind                 (Put = 1, Delete = 2)
        lenpref(key)
        if kind == Put: lenpref(value)
    u32 LE range_tomb_count
    for each range tombstone:
        lenpref(start_key)
        lenpref(end_key)
    u64 LE bloom_bitmap

5. The Three Canonical Fixtures

Captured from the Rust reference and pinned in scripts/cross_test.sh:

Fixtureseedopskeysscenariosha256 of dump
A4220032tieredcompactfc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
B750064universalcompact05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
C9930016withrange4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4

If you change anything about the workload or the wire format, these hashes change. That's the contract: the hashes are intentional padlocks on behavioural drift.

6. lsmctl CLI

lsmctl workload --seed S --ops N --keys K --scenario {ptonly|withrange|tieredcompact|universalcompact}

Prints the lowercase hex sha256 of dump_state() followed by a newline. Exit code is 0 on success, 2 on argument errors. All three ports must agree on stdout byte-for-byte for the same arguments.

7. Reproducing the Hashes

cd db-21-storage-engine-advanced
./scripts/verify.sh     # all unit tests
./scripts/cross_test.sh # cross-language byte equivalence

Expected last line: === ALL OK ===.

Observation — db-21

1. The Three Hashes

A  seed=42  ops=200  keys=32  tieredcompact     fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
B  seed=7   ops=500  keys=64  universalcompact  05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
C  seed=99  ops=300  keys=16  withrange         4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4

All three languages produce all three hashes on the first run after each clean build. This was not a happy accident — it required keeping every sneaky source of nondeterminism out of the merge step:

  • HashSet iteration order doesn't leak (we sort out_entries by key after the merge, and we never serialise the seen set).
  • Map ordering doesn't leak (Go uses a map[string]struct{} for dedupe but never iterates it; entries come out of a slice).
  • Floating-point comparison doesn't leak (the ratio is 0.5 exactly, which is a representable f64; Σ size ≤ ratio · size is integer-vs-rational with no rounding ambiguity at this scale).

2. What Bit Us During Development

  1. Two-pass size-tiered. An early draft computed prefix_sum once to pick chosen, then recomputed it inside the merge call. The two passes drifted under refactoring. Fixed by collapsing to a single pass that updates prefix_sum inline.

  2. Go math.Float64bits. Initial Go draft tried to avoid the math import by writing a wrapper chain (float64bitsfloat64bitsFallbackmath_Float64bits). The chain was broken (no math import to define the leaf). Lesson: don't fight the standard library for ceremony.

  3. C++ std::optional<std::string> for Get. Worth the friction versus a sentinel value: a Put of the empty string is distinguishable from absent, which is testable in dedup_keeps_last.

3. What We Didn't Observe (and why that's good)

  • No platform endianness surprises. macOS arm64 produced the same hashes the canonical fixtures pin. The explicit LE encoding in every put-int helper means we'd survive a big-endian port too.
  • No f64 rounding drift. The ratio is 0.5 and the sizes are small integers; nothing forces denormals or transcendental math.
  • No SHA-256 mismatch. The Rust port uses an inline impl in lsmctl.rs; the Go port uses crypto/sha256; the C++ port uses the 64-line public-domain reference at the bottom of adv.cc. Three independent SHA-256 implementations agreeing on three hashes is the cheapest possible end-to-end test.

4. Resource Profile

Each cargo build --release takes ~5s cold. go build ~1s. cmake --build ~3s. cross_test.sh from cold runs in ~10s including all three builds. No external network, no Docker, no system packages beyond a working C++20 toolchain, Go ≥ 1.22, and Rust stable.

Verification — db-21

1. What "Verified" Means Here

Two distinct claims:

  1. Per-language correctness: unit tests in each language pass.
  2. Cross-language byte equivalence: three independent implementations produce identical canonical wire dumps for three fixed workloads, proven by sha256.

Both must hold. (1) without (2) lets each port drift independently into a "self-consistent but wrong" state.

2. Per-Language Unit Tests

Ten tests, mirrored across all three ports:

#NameAsserts
1bloom_hit_missBloom positive case + a definite negative
2bounds_short_circuitGet skips SST when key outside [smallest, largest]
3range_tomb_hides_older_putNewer range tomb shadows older Put
4range_tomb_respects_newer_putOlder range tomb does not shadow newer Put
5tiered_picks_prefixcompact_size_tiered picks ≥2 prefix
6universal_picks_runcompact_universal picks ≥3 contiguous run
7noop_compactionReturns false when no eligible group
8dump_determinismTwo dumps of the same state are equal; magic is DSEADV21
9workload_all_scenariosAll four scenarios produce non-empty dumps with correct magic
10dedup_keeps_lastbuild_sst keeps the last Put per key
./scripts/verify.sh
# == Rust ==
# 10 passed; 0 failed
# == Go ==
# ok      github.com/10xdev/dse/db21
# == C++ ==
# 1/1 Test #1: test_adv .........................   Passed
# === OK ===

3. Cross-Language Byte Equivalence

./scripts/cross_test.sh
# == build Rust ==
# == build Go ==
# == build C++ ==
# ok   fixture=A impl=rust fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok   fixture=A impl=go   fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok   fixture=A impl=cpp  fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok   fixture=B impl=rust 05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok   fixture=B impl=go   05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok   fixture=B impl=cpp  05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok   fixture=C impl=rust 4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# ok   fixture=C impl=go   4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# ok   fixture=C impl=cpp  4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# === ALL OK ===

4. What Would Falsify The Claim

A non-exhaustive list of bugs the cross test would catch but a per-language test wouldn't:

  • Forgetting to encode the bloom bitmap as little-endian on a big-endian port.
  • Using host integer width for length prefixes instead of u32.
  • Iterating a hash map at any point in merge_run (non-deterministic order across languages and across runs).
  • Encoding the ratio as "0.5" instead of the IEEE bit pattern.
  • Compacting via "longest run found so far that satisfies threshold at the time of finding", instead of evaluating all runs and picking the global longest.
  • Off-by-one in b = a + 1 + (r3 mod (keys-a)) for the range tombstone end key.

5. Reproducibility Bar

  • macOS arm64, AppleClang 16, Go 1.22, Rust stable (rustc 1.7x).
  • No external dependencies (no sha2 crate, no golang.org/x/..., no OpenSSL): every implementation is self-contained, so the verification step is reproducible offline.
  • All three hashes are pinned in scripts/cross_test.sh and reproduced in this document for paper-trail purposes.

Broader Ideas — db-21

A short scrapbook of "what would I add next if this were a real engine?"

1. Tombstone Garbage Collection

Right now a range tombstone lives forever — it survives every compaction and is copied verbatim into the merged output. A production engine drops a tomb when it's certain no shadowed Puts remain below it. The standard test: the tomb's end_key is < smallest_key of every SST below it. Implementing this would require tracking the "sequence number" or generation of each record, which we deliberately omitted.

2. Coalescing Overlapping Tombstones

Two tombs [k0, k5) and [k3, k7) are equivalent to [k0, k7). Merging them at compaction time shrinks the per-Get cost (the active vector stays smaller). We didn't do it because the canonical-bytes test would then need to specify a normalisation policy (sort by start? coalesce overlaps? coalesce adjacencies?). Each choice is fine, but the choice itself becomes part of the wire format.

3. Multi-Level Layout

The lab keeps SSTs as a flat list. RocksDB has L0 (overlapping ranges allowed, newest writes land here) plus L1..Ln (each level non-overlapping, ratio'd in size). Universal compaction roughly corresponds to a degenerate "L0 only" mode, while leveled compaction is its own beast (each compaction picks one L_i SST and the L_{i+1} SSTs that overlap it). A natural follow-up would implement leveled compaction and re-run the same three fixtures with new hashes.

4. Bloom Quality

A 64-bit single-hash bloom is intentionally bad — it exists to make the test for "bloom misses still must walk older SSTs for range tombstones" trigger reliably on tiny inputs. Real engines use per-SST blooms sized to ~10 bits per key with k≈7 hash functions, giving a false-positive rate ~1%. The change is purely numeric; the wire format would absorb a longer bitmap as a length-prefixed byte string.

5. Snapshot Reads / MVCC

If each entry carried a seq: u64, Get(key, at_seq) would walk SSTs the same way but only consider entries with entry.seq ≤ at_seq. Range tombstones would gain a seq too. The merge step would need to keep older versions until they're below the oldest live snapshot.

6. Why Not Implement These Now?

Each one would multiply the size of the wire format and the surface area of the cross-language tests. The lab's claim is that two ideas (range tombstones, two compaction policies) are enough to stress-test cross- language byte equivalence to the point of being convincing. Adding a third without first writing it down somewhere else would dilute the lesson.

Step 01 — Range Tombstones

Goal

Implement a single record that logically deletes every key in [start, end) without writing one Delete per key, and prove the priority rules with two adversarial tests.

What to Build

  • A RangeTomb { start_key, end_key } value type with a covers(key) predicate (key >= start && key < end).
  • An Sst carries a Vec<RangeTomb> alongside its Vec<Entry>.
  • LsmTreeAdvanced::get walks SSTs newest → oldest, accumulating active tombstones into a local vector as it goes.

The Two Rules That Matter

  1. A range tombstone hides keys older than it — i.e. in SSTs that appear later in the newest-first walk.
  2. A range tombstone does not hide keys newer than it — i.e. in SSTs that appear earlier in the walk.

The Two Tests That Pin Them

  • range_tomb_hides_older_put: newer SST has tomb [k0, k5), older SST has Put(k3, "hello"). get("k3") must return None.
  • range_tomb_respects_newer_put: newest SST has Put(k3, "fresh"), middle SST has tomb [k0, k5), oldest SST has Put(k3, "stale"). get("k3") must return Some("fresh").

Subtlety: Bloom Misses

When the bloom of an SST says "key not here", you cannot return early from get. The skipped SST might contribute a range tombstone that would shadow something below. So a bloom miss continues the walk; only a range-tombstone match early-exits with None.

Done When

  • Both tests above pass in all three languages.
  • The range_tombstones are present in dump_state per Section 4 of docs/execution.md, and the three canonical fixtures still match.

Step 02 — Tiered and Universal Compaction

Goal

Implement two compaction policies as deterministic functions of the current SST sequence and the configured ratio, so that the same input list of SSTs always produces the same output list.

Size-Tiered

Pick the longest prefix L ∈ [2, n-1] of ssts such that Σ size(ssts[0..L]) ≤ ratio · size(ssts[L]).

chosen     = 0
prefix_sum = 0
for L in 1..=n-1:
    prefix_sum += size(ssts[L-1])
    if L >= 2 and prefix_sum <= ratio * size(ssts[L]):
        chosen = L
if chosen < 2: return false
merged = merge_run(ssts[0..chosen])
ssts   = [merged] ++ ssts[chosen..]
return true

The merged SST goes at the newest position, because that's where the newly-merged data conceptually lives.

Universal

Pick the longest contiguous run [i, i+L) with L ≥ 3 such that Σ size(run) ≤ ratio · size(ssts[i+L]) (i.e. the run must be followed by something at least 1/ratio times its total size). Ties broken by smaller i.

best_i, best_L = -1, 0
for i in 0..n:
    if i + 3 >= n: break
    run_sum = 0
    for L in 1..=n-1-i:
        run_sum += size(ssts[i+L-1])
        follow = i + L
        if follow >= n: break
        if L >= 3 and run_sum <= ratio * size(ssts[follow]):
            if L > best_L: best_i, best_L = i, L
if best_L == 0: return false
merged = merge_run(ssts[best_i..best_i+best_L])
ssts   = ssts[..best_i] ++ [merged] ++ ssts[best_i+best_L..]
return true

Merge Semantics (shared by both)

Walk the run newest → oldest:

  1. Append the SST's range tombs to out_tombs.
  2. For each entry:
    • skip if seen[key] (newer-wins),
    • skip if covered by any tomb in active,
    • otherwise emit; mark seen.
  3. Append the SST's range tombs to active (so they apply to older SSTs in the run).

After the walk, sort out_entries by key (for determinism across hash-map iteration orders) and recompute smallest, largest, bloom.

Why the Minimum Lengths

  • Tiered's L ≥ 2 keeps it from being "merge one SST with nothing", which would just rewrite the SST.
  • Universal's L ≥ 3 is RocksDB's actual choice; smaller runs are too frequent to amortise the I/O.

Done When

  • tiered_picks_prefix passes (size-tiered selects the 3-small-SST prefix in front of a big SST and produces a 2-SST result).
  • universal_picks_run passes (universal selects the 3-small run between two big SSTs).
  • noop_compaction passes (both policies return false on a 2-SST tree).
  • All three canonical fixtures still match after this step.

Step 03 — Cross-Language Byte Equivalence

Goal

Prove that the Rust, Go, and C++ implementations produce byte-identical canonical wire dumps for three fixed workloads.

Why This Is The Whole Point

API-level test parity is cheap and weak. "Same input → same hash of a canonical binary dump" is strong: any per-language drift (endian, integer width, map-iteration order, float formatting) surfaces as a hash mismatch on the next run.

The Format (one canonical source)

See docs/execution.md Section 4. Two-line summary:

  • Magic "DSEADV21"f64 LE ratiou32 LE sst_count.
  • Per SST (newest first): bounds (lenpref) ‖ entries (u8 kind + lenpref key + maybe lenpref value) ‖ range tombs ‖ u64 LE bloom bitmap.

The Workload (one canonical source)

See docs/execution.md Sections 1-3. Two-line summary:

  • SplitMix64 PRNG, 3 draws per op, (r1 >> 62) & 3 chooses Put / Put / Delete / RangeTomb. Flush every 8 ops, compact every 16. No residue flush at end.

The Three Fixtures

Fixtureseedopskeysscenario
A4220032tieredcompact
B750064universalcompact
C9930016withrange

Hashes are pinned in scripts/cross_test.sh and reproduced in docs/execution.md Section 5 and docs/verification.md Section 3.

Done When

./scripts/cross_test.sh
# ... ends with ...
=== ALL OK ===

If it doesn't, the diff between two implementations' dumps is the debugging artefact. Decode the first ~16 bytes to confirm magic + ratio, then walk SSTs one at a time — each SST is self-delimiting.

What To Do When A Hash Drifts

  1. Recapture from Rust. If you intentionally changed semantics, the Rust reference dictates the new canonical hashes; update both scripts/cross_test.sh and docs/execution.md Section 5.
  2. Hunt the drift. If you didn't intend to change anything, diff the raw dump_state bytes between the failing pair. The first differing byte tells you where in the format the bug lives. Common culprits: forgot LE, used usize instead of u32, iterated a hash map.

db-22 — Performance and Benchmarking

Why this lab exists

Benchmarks lie. They mostly lie because a benchmark answers a different question than the one you thought you were asking. db-22 is a small, self-contained system whose only purpose is to be measured: a keyed in-memory counter store driven by a deterministic synthetic workload. We freeze a wire format and a workload, hash the resulting state across three implementations (Rust, Go, C++), and use the resulting binary identity as the load-bearing definition of "the same program."

Once correctness is cross-language identical, we can compare throughput of the three implementations on the same hardware honestly — and we can talk about what does and does not constitute a fair comparison.

The data structure: CounterStore

A CounterStore is a mapping i64 -> u64 (counters) plus a single total_ops: u64 running counter. Three operations:

  • incr(k, by): total_ops += 1, counters[k] += by (entry created with value by if missing).
  • decr(k, by): total_ops += 1. If k is missing the call has no further effect (total_ops was already incremented). Otherwise:
    • if current <= by, the entry is removed (saturating decrement);
    • else counters[k] = current - by.
  • get(k) -> Option<u64>: live lookup, returns None if absent.

There are no tombstones. Removed counters leave no trace in the snapshot. This is intentional and matters: it makes the snapshot a pure function of the final live state, not of the history of operations.

The semantic that total_ops is bumped on every call (including no-op decr on missing) is the simplest invariant and is the one against which all golden hashes were computed. Changing it would change every hash.

Wire format: dump_snapshot

The snapshot is a function CounterStore -> bytes whose output must be byte-identical across all three implementations.

offset  size  field
------  ----  ---------------------------------------------
0       8     magic "DSEBENCH" (ASCII)
8       8     total_ops (u64 little-endian)
16      4     distinct_keys (u32 little-endian)
20+     16    one row per key, ascending by key:
              - 8 bytes: key (i64 little-endian)
              - 8 bytes: count (u64 little-endian)

Ordering is the only subtle bit. Rust uses BTreeMap, whose iteration is naturally ascending. Go uses a plain map and explicitly sorts the keys before emitting. C++ uses std::map, also ascending. All three converge on the same byte sequence.

The workload: deterministic by construction

We need a workload that:

  1. Is identical across languages.
  2. Exercises a mix of insert / mutate / delete to produce a non-trivial end state.
  3. Can be scaled in both ops and keys.

We use SplitMix64 for randomness. It is small, fast, has trivially portable arithmetic (just u64 adds, shifts, multiplies, and xors), and needs no library. The constants and step function are well-known:

state += 0x9E3779B97F4A7C15
z = state
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
z = (z ^ (z >> 27)) * 0x94D049BB133111EB
return z ^ (z >> 31)

Each workload iteration draws exactly three u64 words. Drawing the same number every iteration is what keeps the RNG stream identical across languages even when one branch is short and another is long.

r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 62) & 0x3        # 0,1,2 → Incr ; 3 → Decr  (3:1 ratio)
k    = (r2 % keys) as i64
by   = (r3 % 100) + 1          # 1..=100

Three-to-one incr:decr means the counter map grows in expectation, but with keys small relative to ops the map fills up and the decrement path actually deletes entries — both code paths get exercised in any non-trivial run.

Two frozen scenarios

ScenarioseedopskeysSHA-256 of snapshot
A42500324b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
B750002565c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1

scripts/cross_test.sh builds all three binaries and asserts that the hashes match each other and these golden values.

Common cross-language divergence sources (and how we avoid them)

  • Map iteration order. We never iterate HashMap. We sort keys explicitly (Go) or use BTreeMap/std::map (Rust, C++).
  • Integer promotion in shifts. u64-only arithmetic. No mixed signed/unsigned shifts.
  • % semantics for negative operands. r2 is u64; modulus and cast to i64 happen exactly once and in the same order.
  • size_t width. We only put u32/u64 on the wire, never size_t directly.
  • Trailing whitespace / newlines in CLI output. hash prints the hex with no trailing newline. bench writes its line to stderr so it can never pollute stdout that a script might be hashing.

Bench methodology

benchctl bench runs a short warm-up (ops/10 + 1) to pull pages and populate caches, then a single timed pass over ops calls. It prints ops, keys, elapsed_us, ops_per_sec, and distinct (the number of live counters at the end) to stderr.

This is intentionally crude — the workload is a single thread doing in-memory map operations. It is good enough for "is the Rust build twice as fast as the Go build?" type questions; it is not a microbenchmark replacement for criterion / go test -bench / Google Benchmark. The references in references.md cover the deeper rabbit hole.

What you actually learn from this lab

  • Why a benchmark needs to fix a deterministic workload before it fixes a metric.
  • Why "the same program in two languages" needs a binary equality test, not a "looks the same" code review.
  • Why bench harnesses must warm up, isolate stdout/stderr, and avoid hidden allocations inside the timed region.
  • Why HashMap iteration order is a footgun for portable wire formats.

References — db-22

Primary sources on benchmarking

  • Brendan Gregg. Systems Performance: Enterprise and the Cloud, 2nd ed., Addison-Wesley, 2020. The canonical modern reference. Chapter 12 ("Benchmarking") is required reading; the "active benchmarking" methodology and the catalog of common mistakes (cold-cache effects, the wrong saturation point, the wrong unit) frame the entire lab.

  • Brendan Gregg. BPF Performance Tools, Addison-Wesley, 2019. Less directly relevant here but the right book if you want to observe what your benchmark is actually doing on a Linux box.

  • Gil Tene. "How NOT to Measure Latency." Strange Loop 2015. The "coordinated omission" talk. Even on an in-memory benchmark like this one, the principle generalizes: the metric you report has to match the question the user is asking. We intentionally report ops_per_sec, not p99 latency, because a single-threaded synchronous loop does not have an interesting tail.

  • Bryant & O'Hallaron. Computer Systems: A Programmer's Perspective, 3rd ed., Pearson, 2015. Chapter 5 ("Optimizing Program Performance") and Chapter 9 ("Virtual Memory") supply the "always measure one level deeper" instinct used throughout the docs.

Determinism, RNGs, and reproducible benchmarks

Microbenchmarking pitfalls (per-language)

  • Andrey Akinshin. Pro .NET Benchmarking, Apress, 2019. Despite the .NET framing, chapters 1–4 are language-agnostic gold: warm-up, steady state, the dead-code-elimination trap, JIT vs AOT timing.

  • Aleksey Shipilëv. "JMH samples" and his "Nanotrusting the Nanotime" blog post. Java-specific but the lessons are universal — particularly the discussion of System.nanoTime resolution traps, which apply equally to std::chrono::steady_clock and Go's time.Now().

  • Rust: criterion documentation, especially the section on outlier detection.

  • Go: the testing package's Benchmark docs and Dave Cheney's "Five things that make Go fast".

  • C++: Google Benchmark and Chandler Carruth's CppCon talk "Tuning C++".

Cross-language byte-equality engineering

  • The Cap'n Proto encoding spec. A worked example of a wire format designed for cross-language stability. We do not use Cap'n Proto here, but its constraints (fixed-width little-endian, no sentinel ordering ambiguity, no implicit string normalization) are the same constraints we impose on dump_snapshot.

  • Go issue #7986map iteration is intentionally randomized. Read the issue and the surrounding discussion; this is the canonical worked example of why a portable wire format may never iterate a hash map without an explicit sort.

Background reading on what "fast" means

Analysis — db-22

What we are actually trying to do

The brief was "performance and benchmarking." That is a topic, not a problem statement. The first design pass turned it into a problem statement:

Build a tiny system that has one knob (a deterministic workload) and one measurable property (throughput on that workload), then implement it in three languages and use byte-identical correctness as the precondition for any speed claim.

Everything else in the lab follows from that constraint.

Constraints I started with

  • Three languages must produce the same bytes for the same inputs. Without this, "Rust is faster than Go on this workload" is unfalsifiable — they might just be doing different work.
  • No external dependencies for the core data structure. SHA-256 has to be reimplementable from scratch in C++ (no OpenSSL), SplitMix64 has to be reimplemented in all three. This is the same discipline used in db-15 and is the only way to guarantee bit-identity.
  • The workload must be small enough to brute-force-test for determinism, but large enough that a 1% difference in implementation efficiency shows up in the bench numbers. The two scenarios (500 ops / 32 keys and 5000 ops / 256 keys) bracket this.
  • The bench harness must not contaminate the correctness harness. Throughput numbers go to stderr; the hex hash goes to stdout with no trailing newline. A shell script can $(benchctl hash ...) cleanly.

Data structure choice: counter store, not KV store

I considered an mvcc KV store (like db-15), a small B-tree, or even reusing db-20's distributed KV. All three are overkill for what we want to measure here. A i64 -> u64 counter store is:

  • Small enough to fit in ~400 LOC per language.
  • Big enough to exercise the map implementations of each language (BTreeMap, map, std::map).
  • Has interesting cross-language failure modes (HashMap iteration order, signed/unsigned subtraction, integer-width truncation in serialization).
  • Has a workload that genuinely creates and destroys entries, so the map's resize / rebalance / erase code paths all execute.

The saturating-decrement decision

The choice about what decr(k, by) does when by > current or when k is missing is the most consequential semantic decision in the lab. I considered three options:

  1. No-op on missing, total_ops unchanged. Cleaner in some ways but makes total_ops a partial counter: you cannot replay the operation stream and recover the same total_ops. Rejected.
  2. Underflow / panic on negative. Would force the workload generator to remember which keys are live, defeating the determinism. Rejected.
  3. Saturating: bump total_ops, then either remove the entry or subtract. Total_ops always tracks the operation stream. Snapshots are pure functions of the final state, not the operation history. This is what we picked.

The cost is that "decrement past zero" is silently lossy. For a benchmark workload that is the right trade.

Why three RNG draws per iteration

A subtle correctness footgun: if some branches consume fewer RNG words than others, the RNG stream diverges from a different implementation that happens to evaluate the branches in a different order. Drawing all three words before branching means every iteration consumes the same number of RNG bytes regardless of kind. This makes the workload trivially portable.

SplitMix64 over xoshiro / pcg / etc.

SplitMix64 has the smallest state (one u64) and the simplest step (one add, three multiplies, four xors, three shifts). Its only operations are 64-bit integer ops that all three languages handle identically with no surprises. Anything fancier is a footgun for cross-language byte-equality with no upside for a synthetic workload.

Wire format design notes

Little-endian everywhere. ASCII magic so a hexdump -C is human-readable. Length prefix (distinct_keys) so a reader could in principle parse the snapshot incrementally — we never actually do this in the lab, but the format is forward-compatible.

We do not embed keys or ops or seed in the snapshot. The snapshot is purely about the resulting state; the workload that produced it is metadata.

Bench harness design

Four decisions:

  1. One pass, one timing region. No statistical machinery. The exercises that need percentiles or distributions go to criterion / go test -bench / Google Benchmark — not this harness.
  2. Warm-up sized as ops/10 + 1. A small constant warm-up (+ 1 handles ops < 10) that pulls cache lines and triggers allocator first-touch. Empirically this stabilizes the second-pass timing to within a few percent run-to-run.
  3. Bench output to stderr. Lets benchctl bench and benchctl hash use the same flag layout and lets shell scripts redirect them independently.
  4. distinct in the output. It's a sanity check: if the bench reports distinct=0, your workload is collapsing entries faster than it creates them and your throughput number is measuring deletes, not inserts. (See observation.md for the actual numbers.)

What I'd do differently with more time

  • Add a third scenario that intentionally has heavy contention on a single key (small keys, large ops) to make the bench numbers more sensitive to allocator behavior.
  • Wire the bench harness to also produce a flamegraph hint (elapsed_us bucketed per operation kind).
  • Add a --profile flag that runs the workload twice and reports the ratio, as a cheap "is this stable?" check.

Execution — db-22

How to run everything

# from db-22-performance-and-benchmarking/
bash scripts/verify.sh        # runs Rust, Go, and C++ unit tests
bash scripts/cross_test.sh    # builds 3 binaries, asserts byte-identical hashes

Both scripts end with === OK === or === ALL OK === respectively. They exit non-zero on any mismatch.

Per-language invocation

Rust

cd src/rust
cargo test --release --lib tests              # 9 tests
cargo build --release
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./target/release/benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default

The --release profile is important: the debug build of SplitMix64 is substantially slower because the multiplies aren't inlined.

Go

cd src/go
go test ./...                                 # 9 tests
go build -o /tmp/benchctl_go ./cmd/benchctl
/tmp/benchctl_go hash workload --seed 42 --ops 500 --keys 32 --scenario default
/tmp/benchctl_go bench workload --seed 1 --ops 100000 --keys 1024 --scenario default

C++

cd src/cpp
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./test_db22                                   # 9 tests
./benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default

The CMake file defaults to Release with -O3 -DNDEBUG. The test binary #undefs NDEBUG so its assertions are not stripped.

What the scripts do, step by step

scripts/verify.sh

  1. cd into the lab root.
  2. Run the Rust unit tests under cargo test --release --lib tests. We pass --lib tests so cargo only loads the test module from the library crate; without it cargo prints "0 tests" because it tries to discover integration test binaries that don't exist.
  3. Run the Go unit tests with go test ./....
  4. Configure and build the C++ project under src/cpp/build and run ./test_db22.
  5. Print === OK === on success.

scripts/cross_test.sh

  1. Build all three release binaries.
  2. For each of two frozen scenarios, run benchctl hash workload … in all three implementations, capture stdout (no trailing newline), and compare:
    • The three implementations must agree with each other.
    • They must agree with the golden hash committed in the script.
  3. Print === ALL OK === on success.

CLI shape

benchctl hash  workload --seed N --ops N --keys N --scenario S
benchctl bench workload --seed N --ops N --keys N --scenario S
  • hash prints the SHA-256 hex digest of the final snapshot, with no trailing newline, on stdout.
  • bench writes one line to stderr describing the run; its stdout is empty.

Both commands accept identical flags. --scenario is currently a documentation tag — it does not change behavior but is reserved for future workload variants.

Reproducing the frozen hashes

$ ./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015

$ ./target/release/benchctl hash workload --seed 7 --ops 5000 --keys 256 --scenario default
5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1

If you ever see a different hash:

  1. Did you change MAGIC, the wire format, the workload mixing rule, or SplitMix64? Any of those will move every hash.
  2. Did you change the decrement semantics? See analysis.md.
  3. Are you iterating a HashMap or unordered_map instead of a sorted structure? That will give you a random hash run to run.

Observation — db-22

Cross-language hash check

All three implementations agree on the bytes:

=== scenario A ===
rust: 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
go  : 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
cpp : 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
match + golden ok
=== scenario B ===
rust: 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
go  : 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
cpp : 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
match + golden ok

Throughput probe (single representative run)

ops=100000 keys=1024 elapsed_us=7242 ops_per_sec=13806910 distinct=1024

About 13.8 million ops/sec for the Rust release build on a single thread, single core, no contention, on an Apple Silicon laptop. distinct=1024 tells us the map is fully populated at the end of the run — the increment-heavy mix means decrements rarely empty a slot at this keys cardinality.

Read this as: each op costs roughly 70 nanoseconds, of which a chunk is three SplitMix64 draws, a couple of map lookups, and the per-iteration loop overhead. It is in the right ballpark for an in-memory BTreeMap<i64, u64> workload.

What we are not measuring (and why that matters)

  • No allocator pressure beyond the initial map growth. The map reaches steady state after ~keys distinct entries are touched, and the rest of the run is in-place mutation.
  • No I/O, no syscalls, no real memory pressure. The whole working set fits in L2.
  • No latency distribution. We report a single throughput number. For a single-threaded synchronous loop, p99 latency would just be a rephrasing of throughput plus a small jitter from the OS scheduler.
  • No cross-language throughput numbers in this doc. You can collect them yourself with benchctl bench — but be honest about what you've measured (one machine, one moment, one workload).

Why the bench number is stable but not authoritative

The bench subcommand runs a small warm-up pass (ops/10 + 1) before the timed pass. On the order of 100k ops the warm-up is about 10k operations, which is enough to pull all the map slots and K256 SHA constants into the right caches. Without the warm-up the first pass is ~30% slower; with the warm-up, second-pass timings repeat to within a few percent run-to-run.

This is still a crude harness. We are not collecting CPU counters, we are not pinning to a CPU, we are not disabling turbo, we are not controlling for thermal state. Use these numbers for ordering ("did this change make it faster or slower?") and not for absolute claims ("Rust does N nanoseconds per op on this machine").

Sanity checks that fire if you break things

  • scenario_a_frozen / scenario_b_frozen — any change to wire format, mixing rule, or RNG step breaks both of these immediately.
  • splitmix64_known — guards against accidental constant-swap in the SplitMix64 mixing function.
  • sha256_vectors — guards against accidental damage to the SHA implementation in any language.
  • snapshot_layout_two_keys — pins the exact byte layout of a trivial 2-key snapshot, so a wire-format change shows a tightly localized failure (not just "scenario A differs").
  • workload_determinism — same seed/ops/keys gives the same bytes on two consecutive runs.

Verification — db-22

What "verified" means here

For a perf-and-bench lab, "verified" means three things at once:

  1. All three implementations pass their own unit tests (Rust 9, Go 9, C++ 9).
  2. All three implementations produce byte-identical snapshot hashes for both frozen scenarios.
  3. The frozen hashes match the golden values committed in source.

Anything less and the bench numbers are meaningless. You can't claim "Rust does X ops/sec on this workload" if it is not doing the same work as the Go and C++ versions.

How to verify

bash scripts/verify.sh
bash scripts/cross_test.sh

Each script exits non-zero on failure and prints either === OK === or === ALL OK === on success.

Expected last lines:

$ bash scripts/verify.sh
…
=== OK ===

$ bash scripts/cross_test.sh
…
=== ALL OK ===

What each unit test pins

TestPins
sha256_vectorsSHA-256 against known empty and "abc" vectors
splitmix64_knownsplitmix64(0) == 0x8b57dafca0cee644
incr_accumulatesincr adds to existing entries, creates new ones, bumps total_ops
decr_saturates_and_removesdecrement past zero removes the entry
decr_on_missing_is_visible_opdecr on a missing key bumps total_ops but does not create the entry
snapshot_layout_two_keysexact wire bytes of a 2-key snapshot
workload_determinismsame seed twice → same snapshot bytes
scenario_a_frozen / scenario_b_frozenfrozen golden hashes per scenario

The frozen-scenario tests are the highest-value tests in the lab. Any silent change to the wire format, the workload, or SplitMix64 breaks both of them with a clear "got X, want Y" message in the failing language's test output.

Manual sanity checks

# bytes of the smallest meaningful snapshot
./target/release/benchctl hash workload --seed 0 --ops 0 --keys 1 --scenario default
# expected: sha256 of MAGIC || 0_u64 || 0_u32 = the empty-store hash

# determinism
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
# should print the same hex twice

What is not verified by these tests

  • That bench reports the correct throughput. It is impossible to verify a wall-clock number from a test. The bench harness has a distinct= field as a structural sanity check, but the numeric throughput is left to the operator to inspect.
  • That the implementations are equally fast — we only check they are equally correct. The whole point of the lab is to make speed comparisons honest by first making correctness identical.
  • That the implementations would still match on a 32-bit or big-endian platform. The wire format pins little-endian; on a hypothetical big-endian build we'd need a byte-swap in put_u64_le etc.

Broader Ideas — db-22

The lab as it stands is a deliberately minimal harness. These are extensions that would build naturally on top of it.

A. Percentile-aware bench harness

Replace the single-pass timer with a per-operation timing loop that collects a histogram (HDR-style) of per-op latencies. Then bench reports p50 / p90 / p99 / p99.9 in addition to throughput. This is where the Gil Tene "How NOT to Measure Latency" talk earns its keep — even on a synchronous single-thread loop, a long-tail GC pause in Go or a page fault in C++ will move the tail dramatically.

Trap to avoid: the cost of taking a timestamp per op (time.Now() / std::chrono::steady_clock::now()) is itself ~30 ns on most boxes, which is comparable to one workload op. You'd need to time batches of ops and divide.

B. Allocator pressure scenario

Add a third scenario whose workload is deliberately allocator-heavy: short-lived strings as values (move from u64 to String), or a churn pattern that constantly creates and removes keys so the map is forced to resize. The cross-language throughput delta for this scenario would be much larger than for the existing one, and the results would speak to the maturity of each language's allocator.

C. Multi-threaded variant

Wrap CounterStore in a sync primitive and run N workers. The point is not to demonstrate scaling — Mutex<BTreeMap<…>> won't scale — but to demonstrate the difference between coarse locking, sharded locking, and lockfree updates. Each language has different idioms here (parking_lot vs std::sync, sync.Map vs atomic, std::shared_mutex vs std::atomic), and the cross-language comparison becomes a language-features comparison.

D. Snapshot replay / log shipping

Right now dump_snapshot produces bytes that are only used for hashing. Add a restore_snapshot and a small "log" of operations (just the sequence of (op, k, by) triples), and you have a tiny replicated store. Connect three nodes via a deterministic schedule and you have a toy version of db-23.

E. Energy and not-time metrics

On Apple Silicon, powermetrics --samplers cpu_power can give you energy per op. The relative energy of the Rust / Go / C++ implementations on the same workload is a more honest "which is more efficient" claim than throughput, because it folds in stalls, branch mispredictions, and memory bandwidth.

F. Comparison with off-the-shelf benchmark frameworks

Run the same workload under criterion (Rust), go test -bench, and Google Benchmark (C++). Compare:

  • Their reported throughput vs ours.
  • Their reported variance.
  • The shape of their output.

The lab's homegrown harness will look crude in comparison, and that's the point — the exercise of measuring the difference is more educational than the difference itself.

G. Worst-case scenario discovery

Use coverage-guided fuzzing on the workload generator (with the saturating-decrement invariant as the asserted property) to find a seed/ops/keys combination that maximizes either throughput or memory pressure. This connects perf work to the fuzz/property-test discipline used in db-13 and db-15.

H. Cross-architecture verification

Run the existing scripts/cross_test.sh under qemu-user-static for aarch64 / x86_64 / riscv64 and confirm the hashes still match. They should — the wire format is little-endian and the arithmetic is all 64-bit — but the only way to be sure is to actually do it.

I. Cache-aware redesign of CounterStore

std::map / BTreeMap / sorted-Go-slice all use pointer-rich tree structures. A flat sorted array with binary search would be slower for insert but dramatically faster for the iteration step (which is the critical path in dump_snapshot). For a workload that touches each key only a handful of times before snapshotting, the array would be worth measuring.

J. The "ten percent rule"

A small operational rule we picked up doing this lab: any perf change worth claiming must move the bench number by more than ten percent. Below that, run-to-run noise on a laptop dominates. Above that, you can usually attribute the change to a specific code path. The harness is deliberately not precise enough to defend a 2% claim, and that's a feature.

Step 01 — Counter Store

Goal

Implement a CounterStore in each of three languages with byte-identical semantics for incr, decr, and get. The data structure is intentionally small — three operations, two pieces of state — so we can focus on the edge cases that make cross-language byte-identity hard.

What to build

A type/struct/class CounterStore with:

  • An ordered map i64 -> u64 (BTreeMap, sorted-keys map, std::map).
  • A u64 running counter total_ops.
  • incr(k, by): total_ops += 1; add by to (or create) counters[k].
  • decr(k, by): total_ops += 1; if k is missing, stop. Otherwise remove the entry if by >= current, else subtract.
  • get(k) -> Option<u64> / (u64, bool) / std::optional<u64>.

Tests this step should pass

  • incr_accumulates: three incrs across two keys leave the right per-key values and total_ops == 3.
  • decr_saturates_and_removes: incr(1, 5); decr(1, 3); decr(1, 100) leaves the map empty with total_ops == 3.
  • decr_on_missing_is_visible_op: decr(42, 1) on an empty store leaves total_ops == 1 and no entry for 42.

Things to watch for

  • u64 underflow: never compute current - by without the current <= by check first.
  • Go's map: a missing key reads back as the zero value with ok=false. Use the comma-ok form explicitly.
  • C++ std::map::operator[]: avoid it on the read path — it inserts a zero entry as a side effect. Use find.

Acceptance

cargo test --release --lib tests::incr_accumulates and the matching Go / C++ tests all pass.

Step 02 — Snapshot and Workload

Goal

Pin a wire format for CounterStore and a deterministic workload generator so that, given identical (seed, ops, keys), all three implementations produce the same bytes — and therefore the same SHA-256 digest.

What to build

dump_snapshot

A byte serializer with this exact layout:

"DSEBENCH"  (8 bytes, ASCII)
total_ops   (u64 little-endian)
distinct_keys (u32 little-endian)
for each key in ascending order:
    key (i64 little-endian)
    count (u64 little-endian)

Critical details:

  • Ascending iteration order. BTreeMap / std::map are already sorted; Go must call sort.Slice on the keys explicitly.
  • Little-endian for every integer.
  • No padding, no separators, no trailing bytes.

SplitMix64

Implement the standard one-state-word SplitMix64:

state += 0x9E3779B97F4A7C15
z = state
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
z = (z ^ (z >> 27)) * 0x94D049BB133111EB
return z ^ (z >> 31)

Also implement the stateless splitmix64(x) (without the state += step) for the canonical test vector check.

run_workload(seed, ops, keys, scenario)

rng = SplitMix64(seed)
store = empty CounterStore
repeat ops times:
    r1 = rng.next()
    r2 = rng.next()
    r3 = rng.next()
    kind = (r1 >> 62) & 0x3       # 0,1,2 → incr, 3 → decr
    k    = i64(r2 % keys)
    by   = (r3 % 100) + 1
    if kind == 3 -> store.decr(k, by) else store.incr(k, by)
return store.dump_snapshot()

The scenario argument is reserved and ignored for now.

Tests this step should pass

  • sha256_vectors: empty and "abc" SHA-256 vectors.
  • splitmix64_known: splitmix64(0) == 0x8b57dafca0cee644.
  • snapshot_layout_two_keys: incr keys 2 and 1, snapshot is 52 bytes with magic, total_ops=2, distinct_keys=2, then the row for key 1 before the row for key 2.
  • workload_determinism: two runs of the same workload produce byte-identical snapshots.
  • scenario_a_frozen / scenario_b_frozen: hashes match the golden values in CONCEPTS.md.

Things to watch for

  • Always draw three RNG words per iteration, even if a branch only needs two. The RNG stream must be identical across languages.
  • Never iterate a hash map for serialization. Sort first.
  • Don't put size_t or usize on the wire — always serialize as u32 or u64.

Acceptance

scripts/cross_test.sh reports === ALL OK ===.

Step 03 — Bench Harness

Goal

Add a bench subcommand to benchctl in each language that runs the same workload as the hash subcommand and reports a throughput number. The harness should be small enough to read end-to-end but disciplined enough not to lie.

What to build

A bench workload --seed N --ops N --keys N --scenario S subcommand that:

  1. Runs a warm-up pass of ops/10 + 1 operations and discards the result.
  2. Captures a high-resolution start timestamp.
  3. Runs the full ops workload and keeps the resulting CounterStore so we can read distinct from it.
  4. Captures a high-resolution end timestamp.
  5. Writes one line to stderr in this format:
ops=<N> keys=<N> elapsed_us=<N> ops_per_sec=<N> distinct=<N>
  1. Writes nothing to stdout.

The CLI's hash subcommand must remain unchanged: stdout-only, no trailing newline, no diagnostic noise.

Timing primitives by language

  • Rust: std::time::Instant.
  • Go: time.Now() / time.Since().
  • C++: std::chrono::steady_clock.

steady_clock / Instant are the right choice — they are monotonic and not subject to wall-clock adjustments mid-run.

Tests this step should pass

There are no automated tests for bench (timing values can't be asserted), but the structural sanity check is:

./target/release/benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default
# expect on stderr:
# ops=100000 keys=1024 elapsed_us=<some number> ops_per_sec=<some number> distinct=1024
# expect on stdout: nothing

Things to watch for

  • Don't put printf inside the timed region. Allocating a string is ~hundreds of nanoseconds and will dominate small workloads.
  • Don't take a timestamp per op. The cost of Now() is comparable to the cost of one workload op.
  • Don't forget the warm-up. The first pass is dominated by cold-cache effects and first-touch allocator behavior.
  • Don't claim numbers across machines without describing the machine.

Acceptance

Running bench against a 100k-op, 1024-key workload produces a throughput line on stderr and an empty stdout. verify.sh and cross_test.sh continue to pass.

db-23 — Capstone: distributed replicated KV database

This is the final lab. It synthesizes everything from db-01 through db-22 into a single tiny but real distributed key/value database whose state is byte-identical across Rust, Go, and C++ for two frozen scenarios.

What this lab builds

A 3-node replicated KV cluster:

NodeRole
0Leader. The only node that originates writes.
1Follower. Can be taken down mid-run.
2Follower. Always up.

Each write Op (Put or Del) is:

  1. Drawn deterministically from a SplitMix64 stream (see db-04, db-22).
  2. Appended to the leader's log at index log.len() + 1.
  3. Replicated synchronously to every live follower.
  4. Counted as ack'd by every live node whose log already contains that index (plus the leader itself).
  5. Committed on every reachable node when the ack count reaches a majority of 3 (= 2).
  6. Applied: each newly-committed entry mutates the local BTreeMap<i64, i64> state machine in commit-index order.

A catch_up operation lets a recovering follower copy any missing log entries from the leader and advance its commit/apply watermark.

Two scenarios — frozen hashes

The cluster snapshot is the canonical encoding of all three nodes concatenated. We hash it with SHA-256.

Scenarioseedopskeysfault?SHA-256
normal4220016no5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
fault72000128follower 1 down on [ops/2, 3·ops/4)d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792

Both hashes are frozen as constants in src/rust/src/lib.rs, src/go/db23_test.go, and src/cpp/src/db23.h, and cross-checked by scripts/cross_test.sh.

Deterministic workload

For op i the RNG draws three u64s regardless of branch outcome, so the stream is identical no matter which kind of op gets generated:

r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind       = (r1 >> 62) & 0x3   // 0,1,2 -> Put,  3 -> Del
k          = i64(r2 % keys)
v          = i64(r3 % 1000)

The fault schedule is purely a function of ops:

down_start = ops / 2
down_end   = (ops * 3) / 4

At i == down_start follower 1 is marked down; at i == down_end it comes back up and we immediately catch_up. If the loop happens to end while follower 1 is still down, we catch it up once more at the end so all three nodes always converge.

Per-node canonical encoding

magic           : 8 bytes  = "DSEDIST2"
node_id         : u8
term            : u64 LE
commit_index    : u64 LE
log_len         : u32 LE
log[log_len] of:
    term        : u64 LE
    index       : u64 LE
    op_kind     : u8         (0 = Put, 1 = Del)
    key         : i64 LE
    value       : i64 LE     (0 for Del)
kv_len          : u32 LE
kv[kv_len] of (ascending by key):
    key         : i64 LE
    value       : i64 LE

The cluster snapshot is just node0.encode() || node1.encode() || node2.encode(). last_applied is not serialized — after a write loop completes (with terminal catch-up) it always equals commit_index, so it carries no extra information.

Sources of cross-language divergence — avoided

RiskHow we eliminate it
Map iteration orderSort i64 keys ascending in Go (sort.Slice); BTreeMap/std::map already ordered in Rust/C++.
EndiannessAll multi-byte ints written little-endian by hand.
RNG branch-skewAlways draw 3 words per op regardless of kind.
32/64-bit intAll wire types are u8/u32/u64/i64; sizes are explicit.
Apply order under faultApply is gated by a single monotonic commit-index counter, and catch_up is called at well-defined points.
0 value for DelC++/Go fill v=0; Rust matches with explicit value() returning 0 for Del.

What this synthesizes from prior labs

Earlier lab(s)Used here as
db-01 storage primitivesManual byte-level LE encoding.
db-02 data structuresSorted map state machine.
db-03 write-ahead logThe per-node log is the WAL.
db-04 hashingSHA-256 + SplitMix64 PRNG.
db-05/06/07/08 LSM stagesReplaced here by a simpler in-memory state machine, but the apply-log-then-mutate pattern is the same.
db-13 transactionsAtomic apply per committed entry (no partial state).
db-16 distributed fundamentalsReplication, majority quorum, follower catch-up.
db-17 raftLeader-only writes, log indexing, commit watermark.
db-22 perf & benchDeterministic workload + canonical snapshot pattern.

How to verify

bash scripts/verify.sh      # runs all 9 tests in all 3 languages
bash scripts/cross_test.sh  # confirms cross-lang + golden equality

Both must end with === OK === and === ALL OK === respectively.

References — db-23 capstone

Replication and consensus

  • Diego Ongaro & John Ousterhout. In Search of an Understandable Consensus Algorithm (Extended Version). ATC 2014. The Raft paper — the leader/log/commit-index model used by this lab is a direct simplification of it.
  • Leslie Lamport. Paxos Made Simple. 2001. The original majority-quorum log-replication algorithm.
  • Flavio Junqueira, Benjamin Reed, Marco Serafini. ZAB: High-performance broadcast for primary-backup systems. DSN 2011. Used by ZooKeeper; closest in spirit to the leader-only single-quorum model here.

Theory

  • Fischer, Lynch, Paterson. Impossibility of Distributed Consensus with One Faulty Process. JACM 1985. Why deterministic consensus needs failure detectors / partial synchrony.
  • Eric Brewer. Towards Robust Distributed Systems. PODC 2000 keynote (CAP conjecture). Gilbert & Lynch later proved it.
  • Seth Gilbert & Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT 2002.

Practitioner material

  • MIT 6.824 Distributed Systems lectures (esp. lectures 5–8 on Raft).
  • Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly 2017. Chs. 5, 8, 9 on replication, consistency, and consensus.
  • Kyle Kingsbury. Jepsen reports (https://jepsen.io). Practical examples of how real systems violate the guarantees their READMEs claim.

Isolation testing

  • Peter Bailis et al. Hermitage — concrete tests that expose what isolation levels really mean (https://github.com/ept/hermitage).

What this lab does not model

  • Leader election (we hardcode node 0 as leader forever).
  • Log truncation / divergent suffixes (we use synchronous in-process replication, so followers never have entries the leader lacks).
  • Membership changes, log compaction, snapshots, network partitions beyond a single follower being marked down.

Those are the natural follow-on projects after this capstone — see docs/broader-ideas.md.

Analysis — db-23 capstone

Goal restated

Build the smallest possible thing that is honestly a replicated KV database, port it to three languages with byte-identical state, and prove convergence under a deterministic failure scenario.

Design choices

Why synchronous in-process replication?

A real Raft cluster uses goroutines/threads, network RPC, election timeouts, and randomized jitter — all of which are sources of nondeterminism. For a capstone whose entire point is cross-language byte-equality, that would defeat itself.

So instead the "network" is a function call. The Cluster's submit synchronously: appends on the leader, appends on every live follower, and commits if quorum reached. This is provably equivalent to a Raft cluster running in lockstep with no message reordering.

Why majority = 2?

3 nodes, so a quorum is 2. The leader counts itself. As long as either follower is up, the cluster commits. When follower 1 is marked down, follower 2 + leader still form a quorum. If both followers were down simultaneously, writes would block — but our fault schedule never does that, so submit never wedges.

Why a single deterministic leader?

Leader election adds randomness (timeouts) and protocol surface (terms, RequestVote). We pin node 0 as the perpetual leader. The lab still shows the replication half of Raft faithfully; election is left as a follow-on (see broader-ideas.md).

Why three RNG draws per op, including for Del?

If we drew fewer words on Del branches, the RNG stream would advance differently for runs that happen to produce more Dels, and frozen hashes would depend on the kind distribution. By always consuming exactly three words we ensure the stream depends only on seed and ops, not on what kinds happened.

Why drop last_applied from the wire format?

After the final catch_up (which runs unconditionally if follower 1 ended down), every node satisfies last_applied == commit_index. Including it in the encoding would waste bytes and risk a Rust/Go/C++ divergence if one of them computed it slightly differently mid-run. It is a derived quantity, so we omit it.

Failure model

The only fault we inject is a single follower going down for one quarter of the run:

[0, ops/2)              all three nodes replicate
[ops/2, 3·ops/4)        follower 1 down; quorum is {0, 2}
[3·ops/4, ops)          follower 1 up + caught up; all three replicate
end                     final catch_up if still mid-down (handles ops%4)

This produces a clean, hashable post-condition: every node has the same log, the same commit_index, and the same kv map.

Why two scenarios?

  • normal (no fault) shows the happy path and stresses the commit path under a small workload.
  • fault (with the follower window) stresses replication under partial availability and the catch-up code path. The 2000-op size makes the fault window long enough to accumulate hundreds of entries that the recovering follower must replay.

Both must produce the same hash on all three languages.

Execution — db-23 capstone

Build & test

# everything
bash scripts/verify.sh

# cross-language identity check
bash scripts/cross_test.sh

verify.sh runs the 9-test suite in each of Rust, Go, and C++ and ends with === OK ===. cross_test.sh builds three dbctl binaries, runs both scenarios in each language, asserts equality across the three languages, and asserts each matches the frozen golden hash, then ends with === ALL OK ===.

Per-language one-liners

# Rust
( cd src/rust && cargo test --release --lib tests )
( cd src/rust && cargo run --release --bin dbctl -- \
    hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo )

# Go
( cd src/go && go test ./... )
( cd src/go && go run ./cmd/dbctl \
    hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo )

# C++
( cd src/cpp && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j )
src/cpp/build/test_db23
src/cpp/build/dbctl hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo

CLI shape

dbctl hash workload --seed N --ops N --keys N --scenario <normal|fault>
  • Prints the SHA-256 hex of the cluster snapshot to stdout.
  • Writes no trailing newline (matches db-22 convention so shell comparisons stay simple).
  • Exits 2 on bad arguments.

Frozen scenarios

ScenarioCommand
normaldbctl hash workload --seed 42 --ops 200 --keys 16 --scenario normal
faultdbctl hash workload --seed 7 --ops 2000 --keys 128 --scenario fault

Expected outputs:

normal: 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
fault : d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792

Observation — db-23 capstone

What we measured during development

1. Log + commit_index advance lock-step on happy path

Three submits with no fault:

after submit Put(1,100): log=[1] commit=1 kv={1:100}  (all 3 nodes)
after submit Put(2,200): log=[1,2] commit=2 kv={1:100,2:200}
after submit Del(1):     log=[1,2,3] commit=3 kv={2:200}

Each submit returns synchronously with all three nodes already in the post-state. This is the put_then_del_replicates test.

2. Quorum still progresses with one follower down

Take follower 1 down between submits. Leader + follower 2 still form a quorum:

follower 1 down.
submit Put(2,2): leader.commit=2 follower2.commit=2 follower1.commit=1
submit Put(3,3): leader.commit=3 follower2.commit=3 follower1.commit=1

This is the fault_window_then_catchup_converges test. After catch_up(1):

follower1.log.len = 3, follower1.commit = 3, follower1.kv = {2:2, 3:3}

3. The snapshot is byte-identical across languages

For both frozen scenarios:

[normal] rust=5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] go  =5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] cpp =5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] gold=5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[fault]  rust=d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault]  go  =d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault]  cpp =d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault]  gold=d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792

This is what scripts/cross_test.sh prints on success.

4. Snapshot size for a 1-write cluster

8 magic + 1 id + 8 term + 8 commit + 4 log_len
+ 1 entry of (8+8+1+8+8) = 33
+ 4 kv_len + 1 kv of (8+8) = 20
= 82 bytes per node × 3 nodes = 246 bytes total

Verified in snapshot_layout_smoke tests in all three languages.

What we did not observe

  • Any divergence between languages, ever, on either scenario.
  • Any nondeterminism within a single language (each scenario run twice in the determinism tests).
  • Any case where a follower's log moved ahead of the leader — by construction, only the leader appends new entries; followers only ever copy.

Caveat

The cluster is in-process. We cannot observe real network behavior — no message loss, reordering, or partial partitions. The lab models replication semantics under controlled failures, not network robustness. The latter is left to the broader ideas / future work.

Verification — db-23 capstone

Acceptance criteria

#PropertyWhere checked
1SHA-256 implementation matches NIST vectors.sha256_vectors test, all 3 langs.
2SplitMix64 matches the known value splitmix64(0).splitmix64_known test, all 3 langs.
3Happy-path Put/Del fully replicates and applies on every node.put_then_del_replicates test, all 3 langs.
4After a fault window + catch_up, all three nodes converge.fault_window_then_catchup_converges test, all 3 langs.
5Per-node snapshot layout is exactly 82 bytes for a 1-op cluster.snapshot_layout_smoke test, all 3 langs.
6The normal scenario is deterministic (two runs hash-equal).workload_is_deterministic test, all 3 langs.
7The fault scenario is deterministic.fault_scenario_is_deterministic test, all 3 langs.
8Normal scenario hashes to the frozen golden.scenario_normal_frozen test, all 3 langs + cross_test.sh.
9Fault scenario hashes to the frozen golden.scenario_fault_frozen test, all 3 langs + cross_test.sh.

Each language runs its own copy of the 9 tests, so the suite total is 27 assertions of cross-cutting properties plus 6 hash-equality checks across languages (3 langs × 2 scenarios) in cross_test.sh.

How to run

bash scripts/verify.sh      # ends with === OK ===
bash scripts/cross_test.sh  # ends with === ALL OK ===

Failure-mode triage

SymptomLikely cause
Rust passes, Go/C++ fails frozen testMap iteration order — confirm Go sorts keys, confirm std::map (not std::unordered_map).
All three languages disagree on the same scenarioRNG-stream drift — check that step_op draws exactly 3 words regardless of kind.
Determinism test fails within one languageSome hidden non-determinism (HashMap, address ordering). Switch to ordered map.
Snapshot length wrongOff-by-one in log_len/kv_len u32, or wrong endian.
Fault test fails only in C++Probably unsigned char vs char in MAGIC comparison, or signed arithmetic on i64.

Frozen hashes (locked)

HASH_NORMAL = 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
HASH_FAULT  = d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792

These constants live in src/rust/src/lib.rs, src/go/db23_test.go, and src/cpp/src/db23.h, and are also hard-coded in scripts/cross_test.sh. Changing any byte of the wire format requires regenerating all five copies in lock-step.

Broader ideas — what to build next

This capstone is a deliberately minimal replicated KV. Here are the natural follow-on projects, in roughly increasing scope:

1. Leader election

Replace "node 0 is leader forever" with a Raft-style election: randomized timeouts, terms, RequestVote, log-completeness check. Determinism becomes hard the moment timers exist, so frozen-hash testing must be replaced with invariant-style testing (e.g. "every successful read returns a value from the leader's committed log").

2. Real network

Move from synchronous function calls to in-memory channels first, then to TCP RPC, then to UDP with retransmission. At each layer add the corresponding failure injection (drop, reorder, duplicate, delay) and re-verify safety invariants.

3. Log compaction & snapshots

Today catch_up replays the entire leader log. For a long-running cluster this is infeasible; add Raft-style snapshots: leader sends a full kv state plus the index it represents, follower installs that, then resumes from lastIncludedIndex + 1.

4. Membership changes

Add a Reconfigure op that mutates the cluster set. Use the joint-consensus or single-server membership change algorithms.

5. Read consistency levels

  • Stale read: any follower answers from its local kv.
  • Read-your-writes: client reads from leader.
  • Linearizable read: leader confirms it is still leader via a heartbeat to a quorum before answering, or uses Raft's ReadIndex / lease read.

6. Multi-shard / sharded KV

Use a hash of the key to pick a shard; each shard is its own 3-node Raft group. Add a meta-shard that owns the shard map. This is the architecture of TiKV, CockroachDB, Spanner.

7. Transactions across shards

Layer 2PC (with a transaction coordinator log) over the shard groups. Or do Percolator-style snapshot isolation. Or go full Spanner with TrueTime.

8. Jepsen-style testing

Property-based testing with random clients, random faults (partitions, clock skew, node kills), and a linearizability checker (Knossos or Porcupine).

9. Replace the in-process state machine

Plug in the LSM from db-09 or the B-tree from db-15 as the underlying KV store. The replication layer (this lab) shouldn't have to change.

10. Geo-replication

A second tier of replication across regions, with the per-region cluster acting as a single logical replica. Conflict resolution becomes the central question.

Step 01 — Cluster and log

Goal

Define the three core types — Op, LogEntry, Node — and the container Cluster that holds three nodes. No replication yet; the leader appends to its own log only.

Tasks

  1. Define OpKind as Put | Del and Op { kind, k: i64, v: i64 }.
  2. Define LogEntry { term: u64, index: u64, op: Op }.
  3. Define Node { id: u8, term: u64, commit_index: u64, last_applied: u64, log: Vec<LogEntry>, kv: Map<i64,i64> }.
  4. Implement Node::append requiring entry.index == log.len() + 1, with idempotent re-acceptance of an already-present index (used later by catch_up).
  5. Implement Node::apply_committed: while last_applied < commit_index, apply log[last_applied] to kv and increment.
  6. Implement Node::encode with the canonical wire format from CONCEPTS.md.
  7. Implement Cluster::new with three nodes (ids 0, 1, 2) all marked up, and Cluster::encode_snapshot = concat of all three encodings.

Acceptance

  • snapshot_layout_smoke test passes in all three languages.
  • An empty cluster's snapshot has length 3 × (8+1+8+8+4+4) = 99 bytes.

Pitfalls

  • Go map iteration order is undefined — sort keys before encoding.
  • std::map (ordered) in C++, NOT std::unordered_map.
  • All multi-byte ints are little-endian.
  • v for a Del op encodes as 0.

Step 02 — Replication and commit

Goal

Wire Cluster::submit so that one Op propagates from the leader to every live follower, advances commit_index on majority, and applies into the local kv state.

Tasks

  1. In submit(op):
    • Compute leader_idx = leader.log.len() + 1.
    • Build LogEntry { term: leader.term, index: leader_idx, op }.
    • Append on the leader (must succeed).
    • For each follower id 1, 2: if up[fid], append on that follower.
  2. Count acks: start at 1 (leader), then +1 for each up follower whose log.len() >= leader_idx.
  3. If acks >= 2 (majority of 3):
    • Set leader.commit_index = leader_idx; call leader.apply_committed.
    • For each follower whose log.len() >= leader_idx, set its commit_index to leader_idx and call apply_committed.

Acceptance

  • put_then_del_replicates test passes in all three languages.
  • After three submits in a row to a fresh cluster, every node has log.len() == 3 and commit_index == 3.

Pitfalls

  • Don't advance commit_index on a follower that hasn't received the entry — that's how silent divergence happens.
  • The leader always advances on a majority, even if a follower hasn't ack'd, because the leader itself counts.
  • apply_committed must be called after commit_index is bumped, not before.

Step 03 — Fault injection and catch-up

Goal

Add the failure-injection schedule, the catch_up operation, and the top-level run_cluster workload driver — completing the lab.

Tasks

  1. Implement Cluster::set_follower_up(fid, up) (assert fid is 1 or 2, never 0).
  2. Implement Cluster::catch_up(fid):
    • Snapshot the leader's log and commit_index.
    • While the follower's log.len() is less than the leader's, append leader_log[fol.log.len()] to the follower.
    • If the follower's commit_index is below the leader's, set it to the leader's and apply_committed.
  3. Implement step_op(rng, keys):
    • Draw r1, r2, r3 = rng.next() (always three).
    • kind = (r1 >> 62) & 0x3; 0,1,2 → Put, 3 → Del.
    • k = i64(r2 % keys), v = i64(r3 % 1000).
  4. Implement run_cluster(seed, ops, keys, scenario):
    • down_start = ops/2, down_end = (ops*3)/4, with_fault = (scenario == "fault").
    • For i in 0..ops:
      • If with_fault && i == down_start: set follower 1 down.
      • If with_fault && i == down_end: set follower 1 up, then catch_up(1).
      • submit(step_op(rng, keys)).
    • After the loop: if with_fault && !up[1], set follower 1 up and catch_up(1). (Handles ops % 4 != 0.)
  5. Write a dbctl hash workload --seed N --ops N --keys N --scenario <normal|fault> CLI that prints the SHA-256 hex of run_cluster(...).encode_snapshot() with no trailing newline.
  6. Freeze the two scenario hashes as named constants and assert them in two tests per language. Cross-check with scripts/cross_test.sh.

Acceptance

  • verify.sh ends with === OK ===.
  • cross_test.sh ends with === ALL OK ===.
  • The two frozen hashes
    • 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296 (normal, seed=42 ops=200 keys=16)
    • d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792 (fault, seed=7 ops=2000 keys=128) match across Rust, Go, and C++.

Pitfalls

  • Drawing fewer RNG words on the Del branch will silently desync hashes — always draw three.
  • The post-loop catch-up matters: if the run ends inside the down window, follower 1 still needs to converge.
  • catch_up must clone the leader's log first; mutating both at once in Rust requires careful borrow handling.
  • The "ack on up[fid] only" rule is essential: a down follower contributes zero acks regardless of its log length.