Distributed Systems Engineer — Build Databases & Consensus From Scratch

"What I cannot create, I do not understand." — Richard Feynman

A lab-based curriculum for becoming a senior distributed systems engineer by building the systems you'll one day operate, debug, and replace: LevelDB (LSM-tree storage), SQLite (B-tree storage + SQL), and the three canonical consensus algorithms — Raft, Paxos, and ZAB — all implemented from scratch in Rust, Go, and C++.

Why This Repo Exists

Most engineers treat databases and consensus as black boxes. This curriculum makes them transparent. You will:

Write storage engines that flush, compact, recover, and serve concurrent reads.
Implement consensus protocols that survive node crashes, network partitions, and message reordering.
Reason about hardware trade-offs: SSD vs HDD seek latency, write amplification, fsync cost, io_uring vs blocking I/O, cache-line locality, NUMA effects.
Compare algorithm families: LSM vs B-tree, level-based vs size-tiered compaction, Raft vs Multi-Paxos vs ZAB.
Build the same thing three times — once in each language — to internalize the design (not the syntax).

Curriculum at a Glance

Phase	Theme	Labs
1	Storage Primitives & Foundations	`db-01` … `db-04`
2	LevelDB / LSM-Tree	`db-05` … `db-09`
3	SQLite / B-Tree	`db-10` … `db-15`
4	Consensus Algorithms	`db-16` … `db-20`
5	Advanced Storage & Capstone	`db-21` … `db-23`
6	Cloud Gateway & Application Networking	`gw-00` … `gw-12`
7	Platform & Distributed Systems Architecture	`pa-00` … `pa-10`

See PHASES.md for the full breakdown with learning objectives per lab.

Phase 6 — Cloud Gateway & Application Networking

Phases 1–5 build the systems behind a request (storage + consensus). Phase 6 builds the systems in front of a request — the L4/L7 data plane, the API gateway, the WebSocket fleet, the xDS control plane, and the Kubernetes substrate they run on. It is written toward the Netflix Distributed Systems Engineer — Cloud Gateway role: every lab maps to a job-description requirement and a Netflix engineering talk (Zuul, connection churn, Pushy, Studios security, Titus→Kubernetes, NRI/OCI hooks). Start at the Hitchhiker's Guide, the Phase 6 overview, and the interview playbook. Every lab ships real, go test -race-green, stdlib-only Go (an L4 proxy, an HTTP/2 parser, a Zuul-shaped gateway, the connection-churn fix, a Pushy- style WebSocket fleet, the resilience toolkit, an mTLS gateway, an xDS control plane, a Kubernetes operator, the observability primitives, and a migration rollout engine) plus a maintainer-level GUIDE.md. Verify the whole phase: bash gw-00-cloud-gateway-overview/verify-all.sh.

Phase 7 — Platform & Distributed Systems Architecture

Phase 7 zooms out from any single system to the architecture of a platform: how you decompose it into services, connect them (sync RPC + async events), partition and keep data consistent, ship it (IaC, GitOps), keep it reliable (SLOs), and govern the decisions (ADRs, fitness functions, consensus). It's written toward the Apple Software Architect – Distributed Systems & Platform Engineering role; topics already covered earlier (Kubernetes operators/CRDs, service mesh, tracing, circuit breakers) are cross-referenced rather than rebuilt. Every lab ships real, go test -race-green, stdlib-only Go — an event bus, a partitioned commit log, the transactional outbox + a saga orchestrator, consistent hashing + quorums, an IaC engine, a GitOps reconciler, an SLO/burn-rate engine, and architecture fitness functions — plus a maintainer-level GUIDE.md. Start at the Hitchhiker's Guide, the overview, and the architect interview playbook. Verify the whole phase: bash pa-00-platform-architecture-overview/verify-all.sh.

How To Use This Repo

Read TOOLS.md and install the required toolchains (Rust, Go, C++/CMake).

Start with db-01-storage-primitives/. Each lab is self-contained and has the same shape:

db-NN-<name>/
├── CONCEPTS.md       # The "why" — read this first
├── references.md     # Papers and source-code links to study
├── docs/
│   ├── analysis.md       # Design trade-offs (hardware, algorithmic)
│   ├── broader-ideas.md  # Extensions, alternatives, future work
│   ├── execution.md      # Toolchain versions, quick-start commands
│   ├── observation.md    # Debugging, profiling, monitoring
│   └── verification.md   # Pass/fail checks for your implementation
├── steps/            # Numbered, sequential implementation guides
│   ├── 01-*.md
│   └── 02-*.md
└── src/
    ├── rust/         # Cargo workspace
    ├── go/           # Go module
    └── cpp/          # CMake project

Work through steps/ in order. The reference code in src/ is a target — try to write your own first, then compare.
Run the checks in docs/verification.md before moving on.

What You Will Build

By the end of the curriculum you will have implemented (×3 languages):

A crash-safe write-ahead log with CRC32 checksums and group commit.
A skip-list MemTable, an SSTable file format with block compression, and level-based compaction.
A page-oriented B+-tree with a pager, rollback journal, and WAL mode.
A hand-written SQL tokenizer, parser, AST, and bytecode virtual machine.
A transaction manager with MVCC snapshot reads and serializable writes.
A complete Raft implementation with snapshotting and membership changes.
Single-decree Paxos and Multi-Paxos with a stable leader.
A simplified ZAB broadcast layer with epoch transitions.
A 3-node distributed KV store combining Raft with your LevelDB clone.
A capstone mini distributed SQL database (the storage engine, the SQL frontend, and Raft replication — all your own code).

Prerequisites

Comfortable with C-family syntax in at least one systems language (you'll pick up the other two as you go).
Familiarity with binary trees, hash tables, and Big-O analysis.
Basic Linux command-line and git.
Not required: prior distributed systems knowledge, SQL internals knowledge, or database engine experience. We build it all from the ground up.

Pedagogical Style

Modeled after cstack/db_tutorial (concept-first, incremental, runnable code at every step) and the ai-engineering/ lab repo (consistent 8-part CONCEPTS.md, docs/, steps/, src/ structure).

Every CONCEPTS.md follows the same 8-part framework:

What Is It — one-paragraph executive summary
Why It Matters — concrete benefits
How It Works — ASCII architecture diagram
Core Terminology — table of precise definitions
Mental Models — analogies for intuition
Common Misconceptions — myths corrected
Interview Talking Points — what to say in a senior systems interview
Connections to Other Labs — how this fits the bigger picture

Status

Phase	Status
Phase 1 — Storage Primitives	Lab 01 complete, 02–04 scaffolded
Phase 2 — LevelDB	Scaffolded
Phase 3 — SQLite	Scaffolded
Phase 4 — Consensus	Scaffolded
Phase 5 — Advanced & Capstone	Scaffolded
Phase 6 — Cloud Gateway	Concepts + docs + real tested Go (gw-00…gw-12)
Phase 7 — Platform Architecture	Concepts + docs + real tested Go (pa-00…pa-10)

See PHASES.md for per-lab status.

License

MIT — see source headers in each implementation.

Phases & Labs

This curriculum has 7 phases and 46 labs. Phases 1–5 build a database and the consensus protocols underneath it; Phase 6 turns to the adjacent domain of cloud gateways and application networking (the Netflix Cloud Gateway track); Phase 7 zooms out to platform & distributed-systems architecture (the Apple Software Architect track) — how you decompose, connect, partition, ship, and govern a platform. Phases build on each other, but within Phase 4 (consensus) you can do Raft → Paxos → ZAB in any order after the foundations in db-16, and Phases 6 and 7 can each be read independently of Phases 2–5 once you have the distributed-systems mindset from Phase 4.

Legend: ✅ complete · 🟡 scaffolded · ⬜ planned

Phase 1 — Storage Primitives & Foundations

Before you can build a database, you need to understand the medium it lives on.

Lab	Title	Status	Key Concepts
db-01	Storage Primitives	✅	Pages, byte order, `mmap` vs `pread`, alignment, HDD/SSD/NVMe latency
db-02	Data Structures for Storage	🟡	Skip lists, hash tables, when in-memory vs on-disk structures differ
db-03	Write-Ahead Log	🟡	WAL framing, CRC32, `fsync` semantics, group commit
db-04	Bloom Filters & Hashing	🟡	FPR math, xxHash vs Murmur, cuckoo & xor filter alternatives

Phase 2 — LevelDB / LSM-Tree

Build a production-shape LSM-tree key-value store, the way Google built LevelDB and Meta forked it into RocksDB.

Lab	Title	Status	Key Concepts
db-05	LSM MemTable	🟡	Skip-list MemTable, immutable MemTable, flush trigger
db-06	SSTable Format	🟡	Data/index/filter blocks, restart points, footer
db-07	LSM Compaction	🟡	Level vs size-tiered vs universal, write amplification
db-08	Block Cache & Iterators	🟡	LRU, MergingIterator, snapshot via sequence numbers
db-09	LevelDB Complete	🟡	Open/close, WriteBatch, recovery, YCSB benchmark

Phase 3 — SQLite / B-Tree

Build a B+-tree storage engine, a pager, a SQL parser, a bytecode VM, and a transaction manager.

Lab	Title	Status	Key Concepts
db-10	B-Tree Fundamentals	🟡	B-Tree vs B+-Tree, page layout, splits & merges
db-11	Pager System	🟡	Page cache, rollback journal vs WAL mode, checkpointing
db-12	SQL Frontend	🟡	Tokenizer, parser, AST, VDBE bytecode VM
db-13	Transactions & MVCC	🟡	ACID, isolation levels, SQLite locks, MVCC vs 2PL
db-14	Indexes & Query Planning	🟡	Secondary indexes, cost-based planner, ART, BRIN
db-15	SQLite Complete	🟡	JOINs, aggregation, TPC-H subset benchmark

Phase 4 — Consensus Algorithms

The three canonical consensus families — implemented, tested, and compared.

Lab	Title	Status	Key Concepts
db-16	Distributed Fundamentals	🟡	CAP, FLP, linearizability, vector clocks, HLC
db-17	Raft	🟡	Election, AppendEntries, snapshotting, ReadIndex
db-18	Paxos	🟡	Single-decree, Multi-Paxos, Flexible Paxos
db-19	ZAB	🟡	Epochs, zxids, primary-backup vs leader-based
db-20	Distributed KV Store	🟡	Raft + LevelDB backend, linearizable reads, sharding

Phase 5 — Advanced Storage & Capstone

Lab	Title	Status	Key Concepts
db-21	Advanced Storage	🟡	`io_uring`, `O_DIRECT`, columnar layout, WiscKey
db-22	Performance & Benchmarking	🟡	YCSB A–F, flamegraphs, NUMA, perf counters
db-23	Capstone Distributed DB	🟡	SQL → planner → LevelDB → Raft; 2PC over Raft groups

Phase 6 — Cloud Gateway & Application Networking

The adjacent domain: the L4/L7 data plane, the API gateway, the WebSocket fleet, the control plane that programs them, and the Kubernetes substrate they run on. Built toward the Netflix Cloud Gateway role; each lab maps to a JD requirement and a Netflix talk. See gw-00 overview and the interview playbook.

Lab	Title	Status	Key Concepts
gw-00	Overview & Role Readiness	✅	JD↔lab map, talks decoded, interview/30-60-90 playbooks
gw-01	The L4 Data Plane	✅	TCP/UDP, epoll/event loop, backpressure, PROXY protocol, drain
gw-02	L7 Protocols	✅	HTTP/1.1, HTTP/2 frames/HPACK/flow control, gRPC, HTTP/3/QUIC
gw-03	API Gateway (Zuul model)	✅	Filter chain, event-loop async, dynamic routing, hot-reload
gw-04	Connection Management	✅	Pooling, per-event-loop pools, subsetting (Van der Corput), churn
gw-05	WebSockets & Pushy	✅	RFC 6455, push registry, async delivery, density, reconnect storms
gw-06	Resilience & Load Balancing	✅	P2C, outlier ejection, retry budgets, circuit breakers, adaptive concurrency
gw-07	Edge Security (mTLS)	✅	TLS 1.3, mTLS, SPIFFE/SVID, cert rotation, authz, zero-trust
gw-08	Envoy & xDS Control Plane	✅	Data/control plane split, xDS (LDS/RDS/CDS/EDS), ACK/NACK, ADS
gw-09	Kubernetes Networking	✅	CNI, kube-proxy, Services/EndpointSlices, conntrack, drain ordering
gw-10	Gateway API & Operators	✅	CRDs, reconcile loop, Gateway API, controller-runtime, finalizers
gw-11	Data-Plane Observability	✅	RED/USE, histograms, trace-context propagation, SLOs, error budgets
gw-12	Capstone: Gateway Migration	✅	Shadow→canary→ramp→soak, reversibility, stakeholder alignment, NRI/OCI

Phase 6 note on shape: Phases 1–5 prove correctness with byte-identical cross-language dumps. Networking systems aren't byte-deterministic, so Phase 6 proves things the industry way: real, compilable, go test -race-green mini-implementations (stdlib-only Go, offline) plus demo programs that print each lab's headline result. Each lab ships CONCEPTS.md (the why), a maintainer- level GUIDE.md (the deep, hands-on companion), references.md, docs/ (analysis + execution + verification), code-rich steps/, scripts/verify.sh, and a working src/go/ you hack on. Run the whole phase with bash gw-00-cloud-gateway-overview/verify-all.sh. Start with the Hitchhiker's Guide.

Phase 7 — Platform & Distributed Systems Architecture

The architect's view: how you decompose a platform into services, connect them (sync + async), partition and keep data consistent, ship it (IaC, GitOps), keep it reliable (SLOs), and govern the decisions. Built toward the Apple Software Architect – Distributed Systems & Platform Engineering role; overlapping topics (operators/CRDs, service mesh, tracing, circuit breakers) are cross-referenced to Phase 6 / the consensus phase rather than rebuilt. See pa-00 overview, Hitchhiker's Guide, and the architect interview playbook.

Lab	Title	Status	Key Concepts
pa-00	Overview & Role Readiness	✅	JD↔lab map, architect interview playbook, 30-60-90, the architect's job
pa-01	Decomposition & Contracts	✅	Bounded contexts, distributed-monolith detection, blast radius, layering rules
pa-02	API Design (REST/gRPC/Events)	✅	Contract compatibility (CI gate), idempotency keys, opaque cursors, versioning
pa-03	Event-Driven Architecture	✅	Pub/sub, at-least-once + idempotent consumers, DLQ, choreography vs orchestration
pa-04	Partitioned Log (Kafka)	✅	Partitions, offsets, consumer groups, assignment, retention
pa-05	Delivery Semantics & Sagas	✅	Transactional outbox, dual-write problem, sagas + compensation, crash recovery
pa-06	Partitioning & Consistency	✅	Consistent hashing (vnodes), quorums (R+W>N), CAP/PACELC
pa-07	Infrastructure as Code	✅	Declarative DAG, plan/diff, topological apply, state, drift detection
pa-08	GitOps & Progressive Delivery	✅	Reconcile/sync/prune/self-heal, sync waves, SLO-gated promotion
pa-09	Reliability: SLOs & Bulkheads	✅	Error budgets, multi-window burn-rate alerting, bulkheads
pa-10	Architecture in Practice	✅	Fitness functions (architecture-as-tests), ADRs, design reviews, consensus

Phase 7 shape: same as Phase 6 — real, go test -race-green, stdlib-only Go (an event bus, a partitioned log, the outbox/saga, a consistent-hashing ring, an IaC engine, a GitOps reconciler, an SLO engine, fitness functions) + a maintainer-level GUIDE.md per lab. Run it all with bash pa-00-platform-architecture-overview/verify-all.sh.

Suggested Pace

Full-time learner: ~2 labs per week ⇒ ~12 weeks end-to-end.
Side-project learner: ~1 lab every 1–2 weeks ⇒ ~6 months.
Reading-only path: skim CONCEPTS.md + docs/analysis.md per lab ⇒ ~1 week for the entire curriculum.

Recommended Progression

Phase 1 (must do all 4 in order)
   │
   ├─→ Phase 2 (LevelDB)  ──┐
   │                        │
   └─→ Phase 3 (SQLite) ────┤
                            ↓
                         Phase 4 (Consensus) ──→ Phase 6 (Cloud Gateway)
                            ↓
                         Phase 5 (Capstone)

Phase 2 and Phase 3 are independent — pick the storage style that excites you first. Phase 4 only references Phase 1 fundamentals, so you can detour into consensus early if you want. Phase 5's capstone assumes all four prior phases.

Phase 6 is a standalone track. It targets the Netflix Cloud Gateway role and only assumes the distributed-systems mindset from Phase 4 (the data-plane/control-plane split echoes consensus; config propagation is a consistency problem). If you're prepping for an application-networking / gateway role, you can jump straight to gw-00 after Phase 4. Within Phase 6, do gw-01 → gw-03 in order; after that the branches are independent and gw-12 (capstone) assumes the rest.

Phase 7 is also a standalone track, aimed at the Apple Software Architect role. It assumes only the Phase 4 distributed-systems mindset and cross-references Phase 6 for the overlapping infrastructure topics (operators/CRDs, service mesh, tracing, circuit breakers). Within Phase 7, do pa-01 → pa-05 in order (the comms + data spine); after that pa-06…pa-09 are independent and pa-10 (architecture-in-practice) ties it together. Start at pa-00.

Glossary

A unified glossary of terms used across all labs. Terms are grouped by domain.

Storage & I/O

Term	Definition
Page	The unit of I/O between disk and memory. Usually 4 KiB (matches OS page size) but databases often use 4–32 KiB.
Block	An SSTable's I/O unit (LevelDB default 4 KiB). Distinct from a B-tree "page" — both are I/O units but for different engines.
mmap	Map a file into process address space. Reads happen via page faults; writes via dirty pages flushed by the kernel.
pread/pwrite	Positional read/write syscalls. Explicit offset, no shared file pointer. Predictable cost, no page-fault stalls.
`O_DIRECT`	Open flag (Linux) that bypasses the page cache. Requires aligned buffers, aligned offsets, aligned sizes.
`fsync`	Force file data + metadata to stable storage. Blocks until disk acknowledges. Often the slowest syscall in a database.
`fdatasync`	Like `fsync` but skips non-essential metadata. Faster on most filesystems.
Write amplification (WA)	Bytes physically written / bytes logically written. SSDs have hardware WA; LSM-trees have algorithmic WA from compaction.
Read amplification (RA)	Bytes physically read / bytes logically read. LSM-trees suffer from RA due to checking multiple levels.
Space amplification	Bytes on disk / bytes of live data. LSMs have space amp from stale data awaiting compaction.
Endianness	Byte order. Little-endian (x86, ARM default): least-significant byte first. Big-endian: network byte order.
Alignment	Memory address being a multiple of N. Required for `O_DIRECT` (usually 512 B or 4 KiB) and SIMD ops.
`io_uring`	Linux async I/O API (≥ 5.1). Two ring buffers (SQ/CQ) shared between kernel and user space.
DMA	Direct Memory Access — disk controller writes directly to RAM without CPU involvement.

Hardware

Term	Definition
HDD seek time	~5–10 ms for random reads (head movement + rotational latency). ~150 MB/s sequential.
SATA SSD	~100 μs random read latency, ~500 MB/s sequential, ~80K IOPS.
NVMe SSD	~50–100 μs random read latency, ~3–7 GB/s sequential, ~500K–1M IOPS. Multiple hardware queues.
Cache line	CPU cache unit, almost always 64 bytes. Data-structure layout for cache locality matters.
NUMA	Non-Uniform Memory Access — CPU sockets have local RAM; cross-socket access is slower.
Wear leveling	SSD firmware spreads writes across blocks to even out flash wear. Causes hardware write amplification.

Data Structures

Term	Definition
Skip list	Probabilistic balanced structure with O(log n) ops and lock-free-friendly properties. Used in LevelDB MemTable.
B-Tree	Self-balancing m-ary tree. Internal nodes store keys + values + child pointers. Used for indexes.
B+-Tree	B-Tree variant where all values live in leaf nodes; internal nodes are pure routing. Used for tables in SQLite.
LSM-Tree	Log-Structured Merge-Tree. In-memory MemTable + on-disk sorted runs (SSTables), merged via compaction.
Bloom filter	Probabilistic set membership; no false negatives, tunable false positive rate. Used to skip SSTable lookups.
ART	Adaptive Radix Tree — modern in-memory index alternative to B-Trees, used by HyPer, DuckDB.

Consensus

Term	Definition
Quorum	Subset of nodes whose agreement is required. Typically ⌊N/2⌋ + 1 for majority quorum.
Term / Epoch	Monotonically increasing identifier for a leadership period (Raft term, ZAB epoch, Paxos ballot).
Log index	Position of an entry in the replicated log. Indices are monotonic and dense.
Commit index	The largest log index known to be safely replicated to a quorum.
Linearizability	Strongest consistency: operations appear to take effect atomically at some point between their invocation and response.
Sequential consistency	All processes agree on a single global order, but the order need not match real-time.
Eventual consistency	If updates stop, all replicas eventually agree. No real-time guarantees.
CAP theorem	Under a network partition, you must choose Consistency or Availability. Partition tolerance is non-negotiable.
FLP impossibility	No deterministic asynchronous consensus protocol can guarantee progress with even one crash failure.
Lamport timestamp	Scalar logical clock: `L(a) < L(b)` if `a` happened-before `b`. Cannot detect concurrency.
Vector clock	Per-node vector. `VC(a) < VC(b)` iff every component is ≤. Detects concurrent events.
HLC	Hybrid Logical Clock: combines physical time with a logical counter; bounded skew from real time.

Transactions

Term	Definition
ACID	Atomicity, Consistency, Isolation, Durability — properties a transaction must satisfy.
Isolation level	READ UNCOMMITTED → READ COMMITTED → REPEATABLE READ → SERIALIZABLE. Each rules out more anomalies.
Dirty read	Reading data written by an uncommitted transaction.
Non-repeatable read	Reading the same row twice in one tx and getting different values.
Phantom read	A range query returns different rows when re-run within one tx.
MVCC	Multi-Version Concurrency Control — writes create new versions; readers see a snapshot.
2PL	Two-Phase Locking — acquire locks in a growing phase, release in a shrinking phase. Guarantees serializability.
2PC	Two-Phase Commit — distributed transaction protocol: prepare phase, then commit/abort. Blocking on coordinator failure.

SQL Engine

Term	Definition
VDBE	Virtual Database Engine — SQLite's bytecode VM that executes compiled SQL.
Prepared statement	A parsed and compiled SQL statement, reusable with different parameters.
Cardinality estimation	Predicting how many rows a query operator will produce. Core to the query planner.
Selectivity	Fraction of rows that satisfy a predicate. Low selectivity ⇒ index scan preferred.
Covering index	An index that contains all columns needed by a query, so the table doesn't need to be touched.

Operational

Term	Definition
Snapshot	A consistent point-in-time view of data. Used for backups, MVCC reads, Raft log compaction.
Checkpoint	Operation that flushes in-memory state to disk so recovery has less log to replay.
Compaction	Background process that merges sorted files (LSM) or reclaims fragmented space (B-tree).
YCSB	Yahoo Cloud Serving Benchmark — standard KV workload suite (A–F). Used in `db-22`.
Jepsen	Test framework for distributed systems correctness; injects partitions/clock skew. Inspires our consensus tests.

Cloud Gateway & Application Networking (Phase 6)

Term	Definition
L4 / L7	Transport (TCP/UDP, opaque bytes) vs application (HTTP/gRPC/WebSocket, per-request) proxying.
Data plane	The proxies on the request path; optimized for p99 latency and throughput.
Control plane	The source of truth that computes + pushes config to the data plane; off the request path.
Event loop	A thread multiplexing many connections via `epoll`/`kqueue`; must never block (Zuul 2 / Netty).
C10K	The problem of serving 10k+ concurrent connections; solved by event loops, not threads.
Backpressure	Slowing a producer when the consumer can't keep up; for a proxy, stop reading one side when the other can't be written.
PROXY protocol	A header prepended to an L4 stream to convey the original client/destination addresses.
Nagle's algorithm	Coalesces small TCP writes; disabled with `TCP_NODELAY` for latency-sensitive traffic.
conntrack	The kernel connection-tracking table; a finite resource a busy gateway can exhaust.
HTTP/2 stream	One independent request/response multiplexed over a single TCP connection.
HPACK / QPACK	Header compression for HTTP/2 / HTTP/3 (QPACK is HOL-blocking-safe).
Flow control (h2)	Per-stream/connection credit windows; application-level backpressure via `WINDOW_UPDATE`.
GOAWAY	HTTP/2 frame to stop opening new streams; the graceful-drain signal.
Head-of-line blocking	A stalled item blocking those behind it: at L4 (TCP), h1 (pipelining), h2 (TCP under mux); fixed at h3.
QUIC / HTTP/3	Reliable multiplexed streams over UDP; no TCP HOL, 0-RTT, connection migration.
Filter chain	Zuul's inbound → endpoint → outbound phases; the programmable request lifecycle.
Connection churn	The rate of opening/closing connections to backends; the thing to minimize.
Connection pool / keep-alive	Reusing warm connections to avoid per-request TCP+TLS handshakes.
Subsetting	Each gateway connects to a subset of origins so total conns = `gateways × subset`, not `× origins`.
Van der Corput sequence	Binary low-discrepancy sequence (bit-reverse `i`); evenly-spread, stable subset selection.
WebSocket (RFC 6455)	HTTP-upgraded, full-duplex, framed channel; server can push at any time.
Push registry	`deviceId → owning node` map for routing a message to a connection (Pushy / KeyValue).
Reconnect storm	Many clients reconnecting at once; tamed with exponential backoff + full jitter.
P2C	Power of Two Choices: pick 2 random endpoints, send to the less-loaded; near-optimal, no coordination.
Outlier ejection	Removing a bad endpoint from rotation based on live error/latency.
Retry budget	Cap on retries as a fraction of total traffic; the anti-amplification rule.
Circuit breaker	CLOSED/OPEN/HALF-OPEN state machine that fails fast on a sick dependency.
Adaptive concurrency	Inferring the right in-flight limit from latency (Little's Law); shed at admission past the knee.
Metastable failure	A self-sustaining overload that persists after the trigger clears (e.g. a retry storm).
mTLS	Mutual TLS — both peers present certificates; cryptographic two-way authentication.
SPIFFE / SVID	A workload-identity URI / the short-lived X.509 cert carrying it (issued by SPIRE / Netflix Metatron).
SNI / ALPN	TLS extensions selecting the server cert by hostname / negotiating the protocol (h2/h3).
Zero-trust	Never trust the network; authenticate every connection, authorize every request.
Envoy	The canonical C++ L4/L7 data plane: listeners → filter chains → clusters → endpoints.
xDS	Envoy's discovery protocol (LDS/RDS/CDS/EDS/SDS) for dynamic config from a control plane.
ADS / Delta xDS	Aggregated (single ordered stream) / incremental variants of xDS.
ACK / NACK (xDS)	Envoy confirming or rejecting an applied config version; the rollout/correctness signal.
CNI	Container Network Interface — the spec + plugins that wire pod networking (veth, IPAM, overlay).
kube-proxy	Programs nodes to map ServiceIP → pod IP (iptables / IPVS / eBPF).
EndpointSlice	Sharded list of ready pod IPs behind a Service; the membership source for LB/EDS.
Readiness probe	Health check gating Service membership; the hook that makes graceful drain work.
CRD / Operator	A custom Kubernetes type / a controller that reconciles it to desired state.
Reconcile loop	Level-triggered, idempotent convergence of actual → desired state (self-healing).
Gateway API	Standard role-oriented K8s CRDs for L7 routing (GatewayClass/Gateway/HTTPRoute); successor to Ingress.
Finalizer	A marker blocking deletion until a controller cleans up external state.
RED / USE	Rate-Errors-Duration (request services) / Utilization-Saturation-Errors (resources).
Golden signals	Latency, traffic, errors, saturation — the four to dashboard and alert on.
SLI / SLO / error budget	Indicator / target / allowed failure that governs release velocity.
Trace context	Propagated identifiers (W3C `traceparent`, `b3`) that stitch spans into one trace across a proxy.
Shadow / mirror traffic	Copying prod traffic to a new path without serving its responses; zero-risk validation.
Canary / sticky canary	A small (consistent-cohort) slice of real traffic on the new path, compared to a control group.
NRI / OCI hooks	Container-runtime extension points to customize per-workload networking/storage (Netflix Titus→K8s).

Platform & Distributed Systems Architecture (Phase 7)

Term	Definition
Bounded context	A cohesive business capability with its own model + boundary (DDD); the basis for a service boundary.
Service contract	The explicit, versioned interface (API + events + guarantees) a service exposes.
Distributed monolith	Services that must deploy together / share a DB / call synchronously per request — a monolith with added latency.
Cohesion / coupling	How related a service's responsibilities are / how dependent services are on each other.
Fan-in / fan-out	Number of services depending on X / number X depends on.
Blast radius	The set of services affected if a given service fails (transitive dependents).
Strangler fig	Incrementally replacing a system by routing slices to the new one.
Conway's Law	System structure mirrors org communication structure.
Backward / forward compatibility	New code reads old data / old code reads new data.
Tag number (protobuf)	A field's stable id; the basis of safe schema evolution (never reuse/retype).
Idempotency key	Client-supplied key letting a server dedupe retries of a non-idempotent op.
Opaque cursor	A tamper-evident pagination token that hides the storage scheme from the contract.
Event	An immutable fact that something happened ("OrderPlaced").
At-least-once / exactly-once	Delivery that may duplicate (the practical default) / a myth; use at-least-once + idempotency.
Dead-letter queue (DLQ)	Where un-processable messages go after exhausting retries.
Choreography / orchestration	Services reacting to events / a central coordinator (a saga).
Commit log	Append-only, ordered sequence of records addressed by offset (the streaming substrate).
Partition	One shard of a topic; the unit of ordering and parallelism.
Offset	A record's monotonic, stable position within its partition.
Consumer group	Consumers sharing a topic's partitions (≤1 consumer per partition).
Rebalancing	Reassigning partitions to consumers on membership change (minimize movement).
Dual-write problem	Updating a DB and publishing an event non-atomically; a crash desyncs them.
Transactional outbox	Writing the event into the DB in the same transaction as the state change; a relay publishes it.
Change-data-capture (CDC)	Publishing events by tailing the DB's write-ahead log (the outbox alternative).
Saga	A sequence of local transactions with per-step compensating actions (vs distributed 2PC).
Compensation	A semantic undo of a completed saga step (refund, recall — not rollback).
Consistent hashing	Hash-ring key placement; a membership change moves ~1/N keys (vs mod-N's reshuffle).
Virtual nodes	Multiple ring positions per physical node; even load + smooth movement.
Quorum (N/W/R)	Replica count / write-acks / read-replicas.
R+W>N	The read/write-quorum overlap condition for strong (read-your-writes) consistency.
Linearizable / causal / eventual	Strongest → weakest consistency models.
CAP / PACELC	Consistency vs availability (under partition) / vs latency (else).
Infrastructure as Code (IaC)	Declarative infra + a diff-and-converge engine (Terraform/Pulumi).
Plan / apply / state / drift	The diff / converge / last-applied snapshot / world-diverged-from-state.
GitOps	Git as the single source of truth + a reconciler that continuously converges live state.
Sync / prune / self-heal	Apply git changes / delete what git dropped / revert manual drift.
Reconcile loop	Level-triggered, idempotent convergence to desired state (IaC/GitOps/operators/xDS).
SLI / SLO / error budget	Indicator / target / allowed-failure currency that governs release velocity.
Burn rate	errorRate / errorBudget; >1 = consuming the budget too fast.
Multi-window burn-rate alerting	Page only when long (sustained) AND short (ongoing) windows both burn fast.
Bulkhead	Per-dependency concurrency isolation so one saturated dependency can't starve others.
Cascading failure	One failure exhausting shared resources, toppling others.
Graceful degradation	Reduced-but-available service under stress.
ADR	Architecture Decision Record: context, decision, alternatives, consequences.
Fitness function	An automated test of an architectural property (cycles, layering, coupling) in CI.
Evolutionary architecture	Architecture as a continuously-tested, changeable property.
Paved road / golden path	The supported, easy default that makes the right thing the easy thing.

Toolchain Setup

All labs target Linux-first with macOS as a supported secondary platform. Windows is not supported (no io_uring, no O_DIRECT semantics we rely on; use WSL2 instead).

Required Versions

Tool	Minimum	Recommended	Why
Rust	1.78	1.82+	`std::io::IoSlice`, stabilized `OnceLock`, edition 2021 features used throughout
Go	1.22	1.23+	range-over-func iterators, improved `slices`/`maps` stdlib, generics maturity
C++	C++20	C++20 (Clang 16+ / GCC 13+)	Concepts, `<bit>` for endian ops, `std::span`, designated initializers
CMake	3.28	3.29+	`CMAKE_CXX_MODULES`, modern `target_link_libraries` semantics
clang-format	17	18+	Consistent C++ formatting across labs
Python	3.11	3.12+	Benchmark plotting & verification scripts (`matplotlib`, `pandas`)

Per-Language Setup

Rust

# rustup is the canonical installer.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable
rustup component add clippy rustfmt
cargo install cargo-nextest        # faster, parallel test runner
cargo install cargo-flamegraph     # used in db-22

Verify:

rustc --version    # rustc 1.78.0 or newer
cargo --version

Go

# macOS
brew install go
# Linux: download from https://go.dev/dl/ — distro packages are usually old.

# Useful tools
go install honnef.co/go/tools/cmd/staticcheck@latest
go install golang.org/x/perf/cmd/benchstat@latest

Verify:

go version    # go1.22 or newer

C++

# macOS
xcode-select --install
brew install cmake ninja llvm

# Linux (Debian/Ubuntu)
sudo apt-get install -y build-essential cmake ninja-build clang-17 clang-format-17 \
                        libsnappy-dev liburing-dev

Verify:

clang++ --version   # Clang 16 or newer
cmake --version     # 3.28 or newer

Optional but recommended:

liburing-dev — required only for db-21 (io_uring lab) on Linux.
libsnappy-dev — used in db-06 (SSTable block compression).
valgrind / lldb — for memory and crash debugging.

Per-Lab Build Commands

Every lab src/<lang>/ is self-contained and has these commands:

Language	Build	Test	Run
Rust	`cargo build --release`	`cargo nextest run` (or `cargo test`)	`cargo run --release --bin <name>`
Go	`go build ./...`	`go test ./...`	`go run ./cmd/<name>`
C++	`cmake -B build -G Ninja && cmake --build build`	`ctest --test-dir build`	`./build/<name>`

docs/execution.md in each lab repeats the exact commands with the lab-specific binary names.

OS-Specific Notes

Linux

io_uring requires kernel ≥ 5.1 (≥ 5.6 for most useful features). Check with uname -r.
O_DIRECT works on most filesystems but is rejected by tmpfs — use a real disk path in tests.
For accurate latency benchmarks, disable CPU frequency scaling: sudo cpupower frequency-set -g performance.

macOS

No io_uring — db-21 falls back to kqueue + worker pool. The lab explains the difference.
O_DIRECT does not exist; use F_NOCACHE via fcntl (the lab provides the wrapper).
fsync(2) does not guarantee data hits stable storage on macOS — use fcntl(F_FULLFSYNC). Labs handle this.

Editor / IDE

Any editor works. VS Code with these extensions is what the reference implementations were written in:

rust-lang.rust-analyzer
golang.go
llvm-vs-code-extensions.vscode-clangd
ms-vscode.cmake-tools

Sanity Check Script

Run this once after setup to verify everything works:

cd db-01-storage-primitives
( cd src/rust && cargo build --release ) && \
( cd src/go && go build ./... ) && \
( cd src/cpp && cmake -B build -G Ninja && cmake --build build ) && \
echo "All three toolchains OK."

Storage Primitives

The lab that earns you the right to talk about databases.

1. What Is It

This lab teaches the physical layer that every storage engine sits on top of: how data moves between a process's memory and a block device. You will learn the OS page model, the byte-order question (endianness), the three main file I/O styles (read/write, pread/pwrite, mmap, O_DIRECT), buffer alignment, and the durability primitive fsync. You'll also internalize the latency numbers for HDD, SATA SSD, and NVMe SSD that drive every storage design decision in the rest of the curriculum. The deliverable is a tiny page allocator plus a hexdump utility, written three times — once in Rust, once in Go, and once in C++ — exercising pread/pwrite against a real disk file.

2. Why It Matters

Every later lab depends on these primitives. LSM-trees, B-trees, WALs — they're all built on pread/pwrite/fsync and an understanding of the page cache.
Choosing the right I/O style changes throughput by 10–100×. A naïve read loop is not the same as pread from many threads, which is not the same as io_uring, which is not the same as mmap.
Hardware shapes the algorithm. LSM-trees exist because random writes on HDDs were catastrophic. NVMe IOPS now make some classic assumptions wrong. Knowing the numbers prevents cargo-culting designs from the wrong decade.
fsync is the single most expensive syscall in any database. Understanding when it must be called — and when you can amortize it — is the difference between 100 commits/sec and 100,000 commits/sec.

3. How It Works

                  User process
   ┌───────────────────────────────────────────────────┐
   │   Your code: page_allocator, db.put("key", val)   │
   │           buffer = [u8; PAGE_SIZE]                │
   └────────────┬───────────────────┬──────────────────┘
                │                   │
                │  pread/pwrite     │  mmap
                │  (explicit copy)  │  (page-fault driven)
                ▼                   ▼
        ┌─────────────────────────────────────┐
        │       Kernel page cache (RAM)       │  ← cached pages,
        │   4 KiB pages, indexed by inode+off │     LRU-evicted
        └────────────────┬────────────────────┘
                         │  block layer
                         │  (scheduler, mq-deadline / none for NVMe)
                         ▼
                ┌─────────────────────┐
                │  Device driver      │  fsync() blocks here
                │  (NVMe / SATA AHCI) │  until disk acks
                └─────────┬───────────┘
                          ▼
                ┌─────────────────────┐
                │  Storage hardware   │  HDD:  ~5 ms  random
                │  HDD / SSD / NVMe   │  SSD:  ~100 µs random
                │                     │  NVMe: ~50  µs random
                └─────────────────────┘

Three things to internalize from this picture:

Without O_DIRECT, you always go through the kernel page cache. Your pread may hit a warm cache (memcpy speed) or cold cache (full disk I/O). Latency variance is enormous.
fsync is the only way to tell the device to flush its own write cache. Without it, "the write returned" means "the kernel accepted it", not "it survives a power loss".
mmap and pread are fundamentally different mental models. mmap makes I/O implicit (page faults), pread makes it explicit (syscalls). LMDB chose mmap. SQLite, LevelDB, and PostgreSQL chose pread/pwrite. We will use pread/pwrite for predictability, and discuss mmap in the analysis.

4. Core Terminology

Term	Definition
Page	Fixed-size unit of I/O between user and storage. The kernel uses 4 KiB; databases pick 4–32 KiB. We use 4 KiB.
Page cache	Kernel-managed RAM that mirrors recently accessed file pages. Transparent to `read`/`write` and `pread`/`pwrite`.
`pread(fd, buf, n, off)`	Read `n` bytes from `fd` starting at byte offset `off`. Does not affect the file pointer. Thread-safe.
`pwrite(fd, buf, n, off)`	Write `n` bytes to `fd` at byte offset `off`. Thread-safe.
`mmap`	Map a file region into the process's address space. Accesses become loads/stores; faults trigger page-ins.
`fsync(fd)`	Block until all dirty data and metadata for `fd` are on stable storage. The durability primitive.
`fdatasync(fd)`	Like `fsync` but may skip metadata updates that aren't required to retrieve the data.
`O_DIRECT`	Open flag (Linux) that bypasses the page cache. Requires 512-byte or 4-KiB alignment on buffers, offsets, sizes.
`F_FULLFSYNC`	macOS-only `fcntl` that actually flushes the drive's cache. `fsync` on macOS is not enough for true durability.
Endianness	Byte order of multi-byte integers in memory. Little-endian = LSB first (x86, ARM default); big-endian = MSB first (network byte order).
Alignment	An address being a multiple of N. Matters for SIMD, DMA, `O_DIRECT`, and many hardware operations.
Sector	The atomic write unit of the device. HDDs: 512 B (legacy) or 4 KiB (Advanced Format). NVMe: usually 4 KiB.
IOPS	I/O operations per second. The right unit for random workloads (HDD ~150, SATA SSD ~80K, NVMe ~500K–1M).
Latency	Time for one operation to complete. Often what users actually feel; throughput hides tail behavior.

5. Mental Models

The page cache is a transparent cache, not a database. Think of pread like Map::get: if the key is in the cache, it's a memcpy; if it's not, the kernel goes to disk for you. You can't observe a cache miss with timing alone in production — that's the whole point of caches and the whole reason benchmarks lie.

fsync is a phone call to the disk. All other writes are "I told the postman" — fast, no guarantee. fsync is "I waited on the line while the disk confirmed the package arrived." Phone calls are slow. Group commits = bundling 100 packages into one call.

mmap is "make the file look like an array". pread is "I will ask for bytes one request at a time". The first is convenient. The second is predictable. Convenience and predictability are usually at war.

Sequential vs random I/O on an HDD is 100× different. On NVMe it's 2–3×. This is why LSM-trees won the 2000s and why "just append" got rediscovered in the 2010s and why NVMe makes some of those assumptions less critical in the 2020s. Hardware shapes design.

6. Common Misconceptions

"write returning means my data is safe." False. The kernel buffered it. Only fsync (or fdatasync for data-only, or F_FULLFSYNC on macOS) guarantees durability.
"mmap is faster than pread because there's no syscall." Often false. mmap access generates page faults, which are also context switches into the kernel, plus they're synchronous (you can't overlap them with computation as easily). LMDB-style designs win when the working set fits in RAM; they suffer on writes due to fsyncing the mapping.
"SSDs make random vs sequential irrelevant." Partially true. Random reads are fast, but random writes still incur garbage collection and write amplification at the firmware level. Sequential writes still reduce hardware WA significantly.
"4 KiB is always the right page size." No. It matches OS page size, which is friendly for mmap and for the page cache. But LevelDB uses 4 KiB blocks (read amp) and 64 MiB SSTables (sequential writes). PostgreSQL uses 8 KiB pages. The "right" page size depends on workload.
"fsync flushes only my file." On many filesystems and many older kernels, fsync could flush more (or less) than expected. Modern ext4/xfs are sane, but historical PostgreSQL fsync bugs (2018) showed that the contract is more subtle than it looks.

7. Interview Talking Points

"For a write-heavy OLTP workload on local NVMe, I'd start with direct pwrite + fdatasync rather than mmap. mmap makes durable writes ambiguous — msync(MS_SYNC) is a heavier hammer than fdatasync because it covers the whole mapping, and you give up control over write ordering."
"My rule of thumb: HDD random read ≈ 5 ms, SATA SSD ≈ 100 µs, NVMe ≈ 50 µs, RAM ≈ 100 ns, L1 ≈ 1 ns. Every five orders of magnitude is where a different design becomes interesting. LSM-trees collapse the gap between random and sequential by converting random writes to sequential ones."
"fsync is what amortizes the difference between 100 commits/sec and 100,000 commits/sec. Group commit batches N concurrent transactions into one fsync, trading latency (one transaction may wait ~5 ms for a batch) for throughput (100× more committed transactions per fsync). Postgres, MySQL InnoDB, and SQLite all do this."
"O_DIRECT isn't a free win. You skip the page cache, so you have to implement your own cache and your buffers must be aligned. PostgreSQL deliberately uses the page cache and lets the OS do that work for it. Oracle and Sybase use O_DIRECT. The choice depends on whether you trust your buffer manager more than the kernel's."

8. Connections to Other Labs

db-02 — uses the page-aligned allocator from here for skip-list and hash-table node storage.
db-03 — the WAL is literally pwrite + fdatasync in a loop; this lab gives you the muscle memory.
db-06 — SSTable blocks are read via pread at known offsets; this lab is the read side.
db-11 — the SQLite pager is a pread/pwrite-based page cache; you'll reimplement what the kernel does for you here.
db-21 — revisits I/O with io_uring (Linux) and O_DIRECT for the advanced cases; this lab establishes the baseline.

References — Storage Primitives

Canonical Papers & Specifications

POSIX pread/pwrite/fsync — https://pubs.opengroup.org/onlinepubs/9699919799/functions/pread.html
Linux open(2) (for O_DIRECT, O_DSYNC) — https://man7.org/linux/man-pages/man2/open.2.html
Linux fsync(2) — https://man7.org/linux/man-pages/man2/fsync.2.html
Linux io_uring design — https://kernel.dk/io_uring.pdf (Jens Axboe, 2019). Read for db-21.
macOS F_FULLFSYNC — man fcntl on macOS; see also Apple Tech Note TN1150.

Hardware Numbers

"Latency Numbers Every Programmer Should Know" — Jeff Dean, 2012. https://gist.github.com/jboner/2841832
"What Every Programmer Should Know About Memory" — Ulrich Drepper, 2007. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf (long but seminal)
NVMe specification — https://nvmexpress.org/specifications/ (skim §3 on queues, §4 on commands)

Battle Stories

"PostgreSQL's fsync surprise" — https://lwn.net/Articles/752063/. Why fsync semantics on Linux were subtler than database authors assumed. Read this.
"Files are Hard" — Dan Luu. https://danluu.com/file-consistency/. Survey of how filesystems can lose your data.
"mmap-based databases vs. read/write-based databases" — Andy Pavlo et al., "Are You Sure You Want to Use MMAP in Your Database Management System?", CIDR 2022. https://db.cs.cmu.edu/mmap-cidr2022/. Required reading if you ever consider mmap.

Implementation References

SQLite OS interface — https://www.sqlite.org/src/file/src/os_unix.c (search for unixSync to see real-world fsync handling, including the macOS F_FULLFSYNC workaround)
LevelDB env_posix.cc — https://github.com/google/leveldb/blob/main/util/env_posix.cc (look at PosixWritableFile::Sync)
LMDB — http://www.lmdb.tech/doc/ (the canonical mmap database; read for contrast)

Books

"Operating Systems: Three Easy Pieces" — Arpaci-Dusseau. Free at https://pages.cs.wisc.edu/~remzi/OSTEP/. Chapters 39–44 (persistence) are exactly this lab.
"Designing Data-Intensive Applications" — Martin Kleppmann, O'Reilly. Chapter 3 ("Storage and Retrieval") frames the LSM vs B-tree debate that drives Phases 2 and 3.

Analysis — Storage Primitives

This document is for the design decisions and the trade-offs. The CONCEPTS.md told you what exists; this tells you why we picked one over the other and what you'd reach for in different conditions.

Decision 1: `pread`/`pwrite` over `read`/`write`

We use pread/pwrite (explicit offsets) instead of read/write + lseek.

Aspect	`read`/`write` + `lseek`	`pread`/`pwrite`
Thread safety on shared `fd`	Unsafe — file pointer is shared, `lseek`+`read` races	Safe — offset is per-call
Syscalls per op	2 (lseek + read)	1
Mental model	Stateful cursor	Stateless random access
Used by	Single-threaded streaming code	All real databases (SQLite, LevelDB, Postgres)

Verdict: pread/pwrite is strictly better for database-style access patterns. The only reason to use the cursor variant is when you genuinely have a single sequential reader (e.g., tail -f).

Decision 2: `pread`/`pwrite` over `mmap`

This is more nuanced. We use explicit I/O for all labs except where we deliberately study mmap.

Aspect	`mmap`	`pread`/`pwrite`
Code complexity	Lower (pointer access)	Higher (explicit calls)
Latency predictability	Bad — page faults are synchronous, can stall on cold pages	Good — every cost is visible in the syscall
Write durability	Tricky — `msync(MS_SYNC)` is expensive and synchronizes the whole mapping	Surgical — `fdatasync(fd)`
Memory accounting	Counts as anonymous memory; hard to reason about WSS	Buffers are yours, you bound them
Large files (> RAM)	Catastrophic — random page-in storms	Fine — you read what you need
Multi-threaded scaling	Page-fault locks scale poorly	Linear scaling with cores
TLB pressure	Hugepages help but transparent hugepage transitions can pause processes	None
Used by	LMDB, BoltDB	SQLite, LevelDB, RocksDB, Postgres

The Pavlo et al. CIDR 2022 paper (linked in references.md) is the definitive teardown. TL;DR: mmap is fine when (a) the dataset fits in RAM, (b) the workload is read-heavy, and (c) you don't care about latency tails. For everything else, pread/pwrite wins.

Decision 3: Page Size = 4 KiB

We pick 4 KiB as the default page size in this lab and reconsider in later labs.

Page size	Pros	Cons
512 B	Old HDD sector; small writes are cheap	Tiny metadata-to-data ratio, lots of indirection
4 KiB	Matches OS page, NVMe LBA, page cache. Sweet spot for OLTP.	Small for analytics
8 KiB	Postgres default. Better for slightly larger rows.	Wastes I/O for tiny tuples
16 KiB	MySQL InnoDB default. Good index fanout.	One row update = 16 KiB write
64 KiB / 1 MiB	Analytics, sequential scans, Parquet row groups	Terrible for random updates

Rule of thumb: page size ≈ the device sector size × small constant. With NVMe at 4 KiB LBA and the OS page also at 4 KiB, going smaller is fighting the hardware and going larger is amortizing a smaller win.

Decision 4: Endianness on Disk

Our on-disk format is little-endian. Justified by:

x86 and ARM (in normal mode) are little-endian. Big-endian on these platforms means a byte swap on every read.
Network protocols use big-endian by convention, but our on-disk format is not a network protocol — it's only read by the same machine (or by an explicit migration tool).
LevelDB and RocksDB use little-endian for fixed-width fields, with varints for variable-width. We follow that convention for compatibility of mental model.
SQLite uses big-endian for historical reasons (the format dates to 2000, when MIPS/PowerPC/SPARC were still common). It's a legitimate alternative; we just optimize for the modern hardware reality.

Always use explicit conversion functions at the I/O boundary. Never memcpy an int to disk and hope. Our Rust code uses u64::to_le_bytes; Go uses encoding/binary.LittleEndian; C++20 uses std::endian + std::byteswap.

Decision 5: When to call `fsync`

The cost of fsync on consumer NVMe is roughly:

Single-threaded latency: ~50 µs–1 ms depending on outstanding writes
Throughput-limited: roughly 5,000–20,000 fsync/sec before contention dominates

The cost on a HDD is 5–15 ms per fsync. That's why ye olde databases did group commit.

The right policy depends on durability requirements:

Policy	What survives a crash	Throughput cost
No `fsync`	Nothing reliably (kernel may flush eventually)	None
`fsync` per write	Every acknowledged write	Massive — one syscall per write
`fsync` per N writes	Last (N-1) writes possibly lost	1/N the cost
Group commit	Every acknowledged write; latency = time-to-batch + `fsync`	Excellent — best of both
`fsync` periodically (e.g., 100 ms)	Last 100 ms of writes possibly lost (MySQL `innodb_flush_log_at_trx_commit=2`)	Excellent

The right design for Phase 2's WAL is group commit: when a writer finishes, it waits on a condition variable; the WAL writer thread pwrites pending records, fdatasyncs once, then wakes every waiter. We'll build this in db-03.

macOS Caveat

On macOS, fsync(fd) does not flush the drive's write cache — it only sends the data to the drive. To get true durability you must call fcntl(fd, F_FULLFSYNC), which can be 10–100× slower than fsync on the same hardware. SQLite, LevelDB, and Postgres all handle this. Our wrapper in src/*/fsync_full.* does the platform dispatch.

Decision 6: `O_DIRECT` — Not in This Lab

We don't use O_DIRECT in Lab 01 because:

It requires aligned buffers (typically 4 KiB), aligned offsets, and aligned I/O sizes.
It bypasses the page cache, so you must implement your own — which is a buffer manager (Phase 3, db-11).
It's not available on macOS — you'd use fcntl(fd, F_NOCACHE, 1) as the closest analogue, but it has weaker semantics.

We revisit O_DIRECT in db-21-storage-engine-advanced once we have a buffer manager worth talking about.

Hardware Numbers Cheat-Sheet

Memorize these. They drive every storage design decision:

L1 cache hit            1   ns
Branch mispredict       3   ns
L2 cache hit           ~4   ns
L3 cache hit          ~15   ns
DRAM access          ~100   ns       — 100× L1
Context switch       ~1–5  µs
NVMe random read      ~50  µs        — 500× DRAM
NVMe sequential read  ~5   µs/4KB
SATA SSD random read ~100  µs
SATA SSD seq read    ~10   µs/4KB
HDD random read       ~5   ms        — 100,000× DRAM
HDD sequential read  ~30   µs/4KB    (~150 MB/s)
fsync on NVMe       ~50 µs–1 ms
fsync on HDD          ~10 ms
F_FULLFSYNC (macOS)   ~10–50 ms      — actually flushes drive cache
Network RTT same DC   ~500 µs
Network RTT same region ~1 ms
Network RTT cross-region ~50–150 ms  — drives Raft heartbeat tuning in db-17

Five-order-of-magnitude gaps are where the design changes. Between L1 and DRAM (100×), you can ignore it. Between DRAM and disk (1000×), you can't. Between disk and network cross-region (1000× again), distributed systems get hard.

What Breaks at Scale

Filesystem journal contention: fsync on ext4 serializes through the FS journal. Many concurrent fsyncs on the same FS don't scale linearly. Mitigation: one WAL file per database, dedicated FS for WAL.
Page cache thrashing: when working set > RAM, every pread is a miss. The kernel's LRU is generic; an app-aware cache (Phase 2's block cache, Phase 3's pager) does better.
fsync failure handling: on Linux, a failed fsync can mark dirty pages as clean — silently losing your data. This is the "fsyncgate" referenced in the references. Mitigation: panic on fsync error and crash-recover from the WAL (modern Postgres does this).
NVMe queue depth: NVMe shines with QD=32–128 in flight. A single-threaded pread loop runs at QD=1 and leaves most of the drive idle. io_uring (Phase 5) fixes this.

Execution — Storage Primitives

Prerequisites

You've completed the toolchain setup in ../../TOOLS.md. To confirm:

rustc --version      # ≥ 1.78
go version           # ≥ 1.22
clang++ --version    # ≥ 16  (or g++ ≥ 13)
cmake --version      # ≥ 3.28

Quick Start — All Three Languages

From the lab root:

# Rust
( cd src/rust && cargo build --release )
./src/rust/target/release/pagealloc write /tmp/lab01.bin 0 "hello, disk"
./src/rust/target/release/pagealloc read  /tmp/lab01.bin 0
./src/rust/target/release/pagealloc hexdump /tmp/lab01.bin

# Go
( cd src/go && go build -o /tmp/pagealloc-go ./cmd/pagealloc )
/tmp/pagealloc-go write /tmp/lab01.bin 0 "hello, disk"
/tmp/pagealloc-go read  /tmp/lab01.bin 0
/tmp/pagealloc-go hexdump /tmp/lab01.bin

# C++
( cd src/cpp && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build build )
./src/cpp/build/pagealloc write /tmp/lab01.bin 0 "hello, disk"
./src/cpp/build/pagealloc read  /tmp/lab01.bin 0
./src/cpp/build/pagealloc hexdump /tmp/lab01.bin

All three binaries are byte-compatible — write with one, read with another, get the same bytes.

CLI Reference (all three implementations)

Command	Effect
`pagealloc write <file> <page_no> <ascii_string>`	Write the ASCII bytes (zero-padded to 4 KiB) into page `page_no`. Calls `fsync` before returning.
`pagealloc read <file> <page_no>`	`pread` page `page_no` (4 KiB), print bytes up to first null.
`pagealloc hexdump <file>`	Walk the whole file 4 KiB at a time and print a canonical hex dump (16 bytes/line).
`pagealloc bench <file> <pages> <iters>`	Random `pread` benchmark: file is preallocated to `pages` pages, then `iters` random reads are timed.

Tests

# Rust
( cd src/rust && cargo test )

# Go
( cd src/go && go test ./... )

# C++
( cd src/cpp && cmake --build build && ctest --test-dir build --output-on-failure )

Each test suite covers:

Round-trip: write then read returns the same bytes.
Cross-implementation: a file written by Rust must read correctly with the Go and C++ binaries (run by scripts/cross_test.sh).
fsync is called on write (verified by strace -e fsync in the cross_test script on Linux).
Endianness sanity: the page header uses little-endian and is identical across implementations.

Environment Variables

Variable	Default	Effect
`DSE_PAGE_SIZE`	`4096`	Override page size (must be a power of two). Only consume in the `pagealloc bench` subcommand.
`DSE_FSYNC`	`1`	If `0`, skip `fsync` on write. Only for benchmarking — never in production.

Observation — Storage Primitives

How to look inside the page cache, watch syscalls, measure latency, and prove to yourself that your code is doing what you think.

Looking at the Page Cache

Linux

# What's in the page cache for our file? (Requires `pcstat` or vmtouch.)
go install github.com/tobert/pcstat/pcstat@latest
pcstat /tmp/lab01.bin

# Drop the page cache (requires root) — to test "cold" reads.
sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

macOS

# `purge` drops the unified buffer cache (requires admin password).
sudo purge

# `fs_usage` is the macOS strace-for-files.
sudo fs_usage -w -f filesys ./src/rust/target/release/pagealloc

Watching Syscalls

# Linux
strace -e trace=openat,pread64,pwrite64,fsync,fdatasync \
       ./src/go/pagealloc-go write /tmp/lab01.bin 0 "hello"

# macOS  (sudo required for dtrace)
sudo dtruss -f -t pread,pwrite,fsync ./src/cpp/build/pagealloc write /tmp/lab01.bin 0 "hello" 2>&1

You should see — in order:

openat(AT_FDCWD, "/tmp/lab01.bin", O_RDWR|O_CREAT, 0644) = 3
pwrite64(3, "hello\0\0...", 4096, 0)                    = 4096
fdatasync(3)                                            = 0
close(3)                                                = 0

If you see read(3, ...) without an offset, you're using buffered I/O — that's wrong for this lab. If you see no fsync/fdatasync, your durability is fake.

Measuring Latency

The bench subcommand measures cold-cache and warm-cache pread latency:

# Preallocate a 100 MB file, then do 10000 random 4 KiB reads.
./src/rust/target/release/pagealloc bench /tmp/lab01.bin 25600 10000

Expected output:

preallocated: 25600 pages = 102400 KiB
warm-cache reads:   p50=3.1 µs   p99=8.4 µs   throughput=315 MB/s
dropped page cache
cold-cache reads:   p50=78 µs    p99=210 µs   throughput=51 MB/s

The exact numbers depend on your hardware. The shape matters:

Warm p50 ≈ 1–5 µs: that's a memcpy from the page cache. No actual disk I/O.
Cold p50 ≈ 50–200 µs on NVMe, 5–15 ms on a spinning disk.
p99 > 10× p50: latency tails are real; this motivates io_uring and dedicated I/O threads.

Profiling Tools

Rust

cargo install cargo-flamegraph
cd src/rust
cargo flamegraph --release --bin pagealloc -- bench /tmp/lab01.bin 25600 100000
# open flamegraph.svg in your browser

Go

cd src/go
go test -bench=BenchmarkPread -cpuprofile=cpu.prof ./...
go tool pprof -http=:8080 cpu.prof

C++

# Linux
perf record -F 999 -g ./src/cpp/build/pagealloc bench /tmp/lab01.bin 25600 100000
perf report

# macOS (use Instruments.app or sample)
sample pagealloc 5 -file /tmp/sample.txt

Watching Disk Throughput

# Linux  (iostat from sysstat package)
iostat -dx 1 nvme0n1

# macOS
sudo fs_usage -w -f diskio

While running pagealloc bench, watch r/s (reads per second), rkB/s, and await (avg I/O latency in ms). For NVMe, expect r/s to plateau in the thousands for QD=1; you'd need io_uring (Lab 21) to push it into the hundreds of thousands.

Verifying Endianness

# Write the integer 0x01020304 into a fresh file (we'll write it as bytes via hexdump).
./src/rust/target/release/pagealloc write /tmp/endian.bin 0 ""
# In a separate REPL session, use whichever language you prefer to write a binary u32 to the file.
# Then xxd the file:
xxd /tmp/endian.bin | head -1

A little-endian system writes 04 03 02 01 for the value 0x01020304. If you see 01 02 03 04, either your machine is big-endian (unlikely on x86/ARM) or your code is using to_be_bytes somewhere.

Verification — Storage Primitives

The pass/fail checks for this lab. If all eight pass for all three implementations, you are done.

Per-Implementation Checks

For each of src/rust, src/go, src/cpp:

V1 — Builds

# Rust
( cd src/rust && cargo build --release ) && echo "RUST OK"
# Go
( cd src/go && go build ./... ) && echo "GO OK"
# C++
( cd src/cpp && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && cmake --build build ) && echo "CPP OK"

V2 — Unit Tests Pass

( cd src/rust && cargo test --release ) && echo "RUST TESTS OK"
( cd src/go   && go test ./... )         && echo "GO TESTS OK"
( cd src/cpp/build && ctest --output-on-failure ) && echo "CPP TESTS OK"

V3 — Round-Trip

# Per binary:
$BIN write /tmp/v3.bin 5 "hello, lab"
$BIN read  /tmp/v3.bin 5 | grep -q "^hello, lab$" && echo "V3 OK"

V4 — `fsync` Is Called (Linux only)

strace -e fsync,fdatasync -o /tmp/syscalls.log $BIN write /tmp/v4.bin 0 "x"
grep -E 'fsync|fdatasync' /tmp/syscalls.log && echo "V4 OK"

(Expected: at least one of fsync(...) or fdatasync(...) in the trace. On macOS substitute sudo dtruss -t fsync.)

Cross-Implementation Checks

V5 — Byte-Compatibility

Files written by one implementation must read identically with the others.

RUST=./src/rust/target/release/pagealloc
GO=./src/go/pagealloc-go
CPP=./src/cpp/build/pagealloc

$RUST write /tmp/v5.bin 3 "cross-lang ok"
$GO   read  /tmp/v5.bin 3 | grep -q "^cross-lang ok$" && echo "GO read RUST OK"
$CPP  read  /tmp/v5.bin 3 | grep -q "^cross-lang ok$" && echo "CPP read RUST OK"

$GO   write /tmp/v5.bin 7 "go writes"
$CPP  read  /tmp/v5.bin 7 | grep -q "^go writes$" && echo "CPP read GO OK"
$RUST read  /tmp/v5.bin 7 | grep -q "^go writes$" && echo "RUST read GO OK"

V6 — Hexdump Identical

$RUST hexdump /tmp/v5.bin > /tmp/v6.rust.hex
$GO   hexdump /tmp/v5.bin > /tmp/v6.go.hex
$CPP  hexdump /tmp/v5.bin > /tmp/v6.cpp.hex
diff /tmp/v6.rust.hex /tmp/v6.go.hex && diff /tmp/v6.rust.hex /tmp/v6.cpp.hex && echo "V6 OK"

V7 — Endianness Sanity

The first 8 bytes of each non-empty page should be a little-endian magic constant 0x44534531_50414745 (DSE1PAGE reversed):

xxd -l 8 /tmp/v5.bin | head -1
# Expected: 00000000: 4547 4150 3145 5344

If you see 4453 4531 5041 4745, your implementation is writing big-endian — fix that.

V8 — Benchmark Smoke

$RUST bench /tmp/v8.bin 1024 1000
# Expected: prints both warm-cache and cold-cache p50/p99 lines without crashing.

Master Script

A single command to run everything (provided as scripts/verify.sh):

bash scripts/verify.sh

Expected output ends with:

====================================================
ALL 8 CHECKS PASSED for RUST, GO, CPP
====================================================

If any check fails, the script exits non-zero and prints which check + which implementation failed.

Broader Ideas — Storage Primitives

Where to go after this lab if you want to push deeper. Each idea is a self-contained extension or alternative.

1. Replace `pread` with `io_uring` (Linux)

The single biggest jump from this lab's design to a modern engine is moving from synchronous syscalls to async submission queues. With pread at QD=1, NVMe runs at ~5% of its IOPS. With io_uring at QD=32+, it hits the spec sheet.

Lab pointer: db-21-storage-engine-advanced does this end-to-end.
Self-study: implement a pread_async API now that internally still uses pread but queues requests through a crossbeam channel (Rust) / goroutine pool (Go) / std::jthread worker pool (C++). When you then swap the backend for io_uring, no API consumer changes.
Reference: Jens Axboe's "Efficient IO with io_uring" (https://kernel.dk/io_uring.pdf), §3.

2. Page Layout — Slotted Pages vs Fixed-Size Records

Our pages are zero-padded ASCII. Real engines use slotted pages:

┌────────┬────────────────────────────┬──────┐
│ header │ slot[0] slot[1] ...        │ free │
│        │ → offsets into page        │      │
├────────┴────────────────────────────┴──────┤
│ ← record N ← record 1 ← record 0           │  (grows from end)
└────────────────────────────────────────────┘

This lets variable-length records share a page without external fragmentation. PostgreSQL, MySQL InnoDB, and SQLite all use slotted pages. Try this: extend pagealloc so each page holds a slot directory and stores up to 16 variable-length keys per page. This is the warm-up for db-10.

3. Copy-on-Write Pages (LMDB-style)

Instead of overwriting a page in place, allocate a fresh page and update the parent to point to it. This is how LMDB achieves single-writer MVCC without a WAL. Pros: simpler crash recovery (just point at the last committed root). Cons: requires a GC for unreferenced pages, doubles write amplification.

Reference: Howard Chu's LMDB tech docs, http://www.lmdb.tech/doc/
Self-study: extend the allocator to track free pages and never overwrite; introduce a "commit" op that just writes a new root pointer atomically.

4. Write Coalescing & Group Commit

Right now every write calls fsync immediately. Even a single concurrent writer benefits from group commit:

#![allow(unused)]
fn main() {
// Pseudocode
let mut pending = vec![];
loop {
    pending.push(receive_write_request());
    if elapsed_since_last_fsync > 100us || pending.len() > 64 {
        pwrite_all(pending);
        fdatasync();
        for req in pending.drain(..) { req.notify_done(); }
    }
}
}

Lab pointer: db-03-write-ahead-log builds this for the WAL. Try it here as warm-up.
Trade-off: latency increases by 100us, throughput rises by ~50× under contention.

5. Direct I/O + Aligned Buffers

O_DIRECT (Linux) or F_NOCACHE (macOS) bypasses the page cache. To use it you need 4-KiB-aligned buffers (in Rust: Layout::from_size_align(4096, 4096)?; in C++: posix_memalign(&buf, 4096, 4096); in Go: trickier — use golang.org/x/sys/unix.Mmap with MAP_ANON).

When this matters: when your app has a better cache than the kernel (e.g., Phase 2's block cache). Oracle, MySQL with O_DIRECT, and most flash-tuned engines pick this.
Self-study: add a pagealloc write-direct subcommand that opens with O_DIRECT and demonstrates the alignment requirement (the program must fail predictably if the buffer is unaligned).

6. Sparse Files & Hole Punching

Files don't have to be contiguous. fallocate(FALLOC_FL_PUNCH_HOLE) releases blocks back to the filesystem without changing the file size. Useful for LSM-tree SSTable compaction (free space after removing dead keys) and for journal log truncation.

Reference: man 2 fallocate
Self-study: add pagealloc punch <file> <page_no> and verify with du -h <file> that the file's apparent size is unchanged but on-disk size shrinks.

7. Crash Testing with `dm-flakey` (Linux)

The hard part of storage code is testing the failure cases. dm-flakey is a Linux device-mapper target that simulates random write failures.

# 5-second window of normal operation, then 1 second of dropping writes, repeat.
sudo dmsetup create flakey-dev --table "0 $size flakey /dev/loop0 0 5 1 1 drop_writes"

Mount your test filesystem on /dev/mapper/flakey-dev and run your pagealloc write loop across the drop window. Without fsync, you should lose data. With fsync, the writes that completed should survive. This is how the real engines test durability.

8. Comparing `mmap` Yourself

We argued for pread/pwrite. Don't take our word for it — implement pagealloc-mmap as a fourth implementation. Compare:

Workload	`pread`	`mmap`
Sequential read of 1 GB	?	?
Random read of 4 KiB × 10⁶ from a 1 GB file (warm)	?	?
Random read from a 100 GB file (cold)	?	?
10⁵ random writes with durability	?	?

Plot the results, write down what surprised you. Bring those numbers to the mmap Pavlo paper (in references.md) and check whether they match.

9. Persistent Memory (PMEM, Optane)

Intel Optane is dead, but Persistent Memory programming patterns survive in CXL.mem and in research kernels. PMEM is byte-addressable like RAM, persistent like SSD, with clwb + sfence as the durability primitive (no fsync). The persistent memory programming library (PMDK) is what to read.

Reference: https://pmem.io/pmdk/
Why it matters: if/when CXL persistent memory becomes commodity, every storage engine in this curriculum will need a rewrite. Already, WiscKey, SplitFS, and uTree are research designs assuming PMEM.

10. Beyond Disk: Object Storage as a Backing Store

Modern cloud-native databases (Snowflake, Databricks, BigQuery) don't pwrite to local disks — they PUT 4 MiB objects to S3. The trade-offs are wildly different (high latency, infinite throughput, eventual consistency until 2020). The closest "primitives lab" for that world would replace pread/pwrite with HTTP range requests. Worth thinking about, especially before db-23's capstone.

Reference: "Lakehouse: A New Generation of Open Platforms" (Armbrust et al., CIDR 2021)

Step 1 — Open a File and Write Bytes

Goal

Build the smallest possible thing that touches the disk: open a file, write some bytes at a known offset, close the file. You'll do this three times — once in Rust, Go, and C++ — so you can feel how each language exposes the same pread/pwrite/fsync primitives.

Prerequisites

Toolchain installed per ../../TOOLS.md.
An empty editor and a terminal in this lab's directory.

What You're Building

A function with this signature (conceptually):

write_page(path: string, page_no: u64, bytes: [u8]) -> Result

Opens (or creates) path for read+write.
Computes offset = page_no * PAGE_SIZE (with PAGE_SIZE = 4096).
Zero-pads bytes to exactly PAGE_SIZE.
pwrites the padded buffer at offset.
Calls fdatasync (or fsync if fdatasync is unavailable).
Closes the file.

Why `pwrite`, not `write`

The classic POSIX write syscall uses the file's seek pointer (lseek). That makes it stateful — two threads writeing to the same fd will race. pwrite takes an explicit offset and is thread-safe. Every database in this curriculum uses pwrite. No lseek in our code, ever.

Why `PAGE_SIZE = 4096`

It matches the OS page size on x86_64 and ARM64, which means the kernel page cache, the device LBA, and your write are all the same unit. Mismatched sizes cause read-modify-write at the kernel layer: writing 100 bytes requires the kernel to first read the 4 KiB page containing those bytes, modify, and write back. By always writing a full page, you avoid that hidden cost.

Why `fdatasync` Over `fsync`

fsync flushes data and metadata (file size, modification time). For a write that doesn't change the file size — the common case in a steady-state database — fdatasync skips the metadata flush, saving a few hundred microseconds per call on average. Use fdatasync when you can.

Rust Implementation

In ../src/rust/src/lib.rs we use the std::os::unix::fs::FileExt::write_at extension, which compiles to pwrite64 on Linux and macOS. Look at the function write_page.

Key idiom:

#![allow(unused)]
fn main() {
use std::os::unix::fs::FileExt;
file.write_all_at(&buf, offset)?;
file.sync_data()?;   // == fdatasync
}

sync_data is Rust's portable name for fdatasync on Linux and fcntl(F_FULLFSYNC) on macOS (Rust 1.78+ uses F_BARRIERFSYNC on macOS, which is a faster middle ground).

Go Implementation

In ../src/go/pagealloc.go, the WriteAt method is pwrite, and f.Sync() is fsync. There is no first-class fdatasync in os, so we call unix.Fdatasync(fd) from golang.org/x/sys/unix.

if _, err := f.WriteAt(buf, offset); err != nil { return err }
return unix.Fdatasync(int(f.Fd()))

On macOS, unix.Fdatasync is not exported (the kernel doesn't have it). We fall back to unix.FcntlInt(fd, unix.F_FULLFSYNC, 0). The wrapper in fsync_full.go handles the platform branch.

C++ Implementation

In ../src/cpp/src/pagealloc.cc:

ssize_t n = ::pwrite(fd, buf.data(), buf.size(), offset);
if (n != static_cast<ssize_t>(buf.size())) return std::errc::io_error;
::fdatasync(fd);

On macOS we use ::fcntl(fd, F_FULLFSYNC). The dispatch is in fsync_full.cc.

Try It

cd src/rust && cargo build --release
./target/release/pagealloc write /tmp/step1.bin 0 "first page"
xxd -l 32 /tmp/step1.bin

Expected output:

00000000: 4547 4150 3145 5344 0000 0000 0000 0000  EGAP1ESD........
00000010: 6669 7273 7420 7061 6765 0000 0000 0000  first page......

The first 8 bytes are our little-endian page magic 0x44534531_50414745 (read as bytes left-to-right: 45 47 41 50 31 45 53 44). Bytes 16+ contain your ASCII payload "first page" followed by zero-padding to 4 KiB.

What Just Happened

You opened a file (open(2) with O_RDWR | O_CREAT).
You wrote exactly one page at exactly one offset (pwrite(2)).
You forced the data to stable storage (fdatasync(2) or F_FULLFSYNC on macOS).
You closed the fd, which does not flush — close(2) returns immediately.

On a power loss between step 3 and step 4, your write survives. Without step 3, it might not.

In Step 2 you'll add the read path and a hexdump utility, and verify that all three implementations produce byte-identical files.

Step 2 — `pread` and Hexdump

Goal

Implement the read side (pread) and a hexdump utility, then prove cross-implementation byte-compatibility: a file written by Rust must read identically from Go and C++.

The Read Side

Symmetric to Step 1:

read_page(path: string, page_no: u64) -> [u8; PAGE_SIZE]

Open the file read-only.
pread(fd, buf, PAGE_SIZE, page_no * PAGE_SIZE).
Return the buffer (caller will strip trailing zeros or use the magic header to validate).

Rust

#![allow(unused)]
fn main() {
file.read_exact_at(&mut buf, offset)?;
}

Go

n, err := f.ReadAt(buf, offset)
if err != nil && err != io.EOF { return nil, err }

ReadAt returns io.EOF if n < len(buf) — this is normal for the last page of a file that hasn't been preallocated. Tests handle this case.

C++

ssize_t n = ::pread(fd, buf.data(), buf.size(), offset);
if (n < 0) return std::errc::io_error;
buf.resize(n);   // shrink if short read

Every non-empty page in our format begins with a 16-byte header:

offset  size  field
------  ----  -----
   0     8    magic = 0x44534531_50414745  (LE: 45 47 41 50 31 45 53 44 ; ASCII reversed: "EGAP1ESD")
   8     2    version = 1
  10     2    flags = 0
  12     4    payload_len  (LE u32, number of bytes used after the header)
  16     n    payload bytes
n+16     —    zero-pad to PAGE_SIZE

This is a deliberately simple format — we'll grow it in later labs. For now it gives us:

A magic number to detect "is this a valid page?"
A version field to evolve the format later.
An explicit payload length so we don't have to scan for zeros (zeros are valid bytes in real data).

The Hexdump Utility

A canonical 16-byte-per-line hex dump:

00000000: 4547 4150 3145 5344 0100 0000 0a00 0000  EGAP1ESD........
00000010: 6669 7273 7420 7061 6765 0000 0000 0000  first page......
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
...

Format spec:

8-digit hex offset, then : .
16 bytes per line, grouped 2 bytes per word, separated by single space.
2-space gap before the ASCII rendering.
ASCII rendering: printable ASCII as itself, otherwise ..

This format matches xxd output for easy diff-based cross-language verification.

Cross-Implementation Test

This is the most important check in this lab. Run scripts/cross_test.sh:

bash scripts/cross_test.sh

What it does (excerpt):

$RUST write /tmp/xt.bin 0 "from rust"
$GO   write /tmp/xt.bin 1 "from go"
$CPP  write /tmp/xt.bin 2 "from cpp"

$RUST hexdump /tmp/xt.bin > /tmp/h.rust
$GO   hexdump /tmp/xt.bin > /tmp/h.go
$CPP  hexdump /tmp/xt.bin > /tmp/h.cpp

diff /tmp/h.rust /tmp/h.go || { echo "RUST/GO mismatch"; exit 1; }
diff /tmp/h.rust /tmp/h.cpp || { echo "RUST/CPP mismatch"; exit 1; }
echo "cross-language byte-compat OK"

If this fails, the most common bugs are:

Wrong endianness on the magic or payload_len.
Forgetting to zero-pad the page (one impl leaves junk past the payload).
Off-by-one on the offset calculation (page_no * PAGE_SIZE vs (page_no + 1) * PAGE_SIZE).

What Just Happened

You now have a portable, file-format-compatible storage primitive across three languages. This is the foundation for every later lab — the WAL in db-03 is exactly this with append-only semantics, and the SSTable in db-06 is this with a richer block format.

Step 3: measure latency, demonstrate the page cache, and understand why your second read of a page is 1000× faster than the first.

Step 3 — Benchmark and the Page Cache

Goal

See the page cache with your own eyes by measuring warm-cache and cold-cache pread latency. This is the experiment that should make you suspicious of every microbenchmark you ever read.

The Benchmark

pagealloc bench <file> <pages> <iters>:

Preallocate file to pages * 4 KiB using a sequential write loop.
fsync to make sure it's on disk.
Time iters random preads of one page each.
Drop the page cache.
Time iters random preads again.
Print p50/p99/p999 for each phase plus throughput in MB/s.

Implementation lives in:

../src/rust/src/bin/pagealloc.rs (bench subcommand)
../src/go/cmd/pagealloc/main.go (bench subcommand)
../src/cpp/src/main.cc (bench subcommand)

Dropping the Page Cache

On Linux:

sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Our benchmark binary calls this automatically if it can. If it can't (no sudo), it prints a warning and skips the cold phase.

On macOS:

sudo purge

Same logic — the binary attempts it, warns if it can't.

Expected Numbers

On a modern laptop with NVMe:

$ ./pagealloc bench /tmp/bench.bin 65536 50000   # 256 MB file, 50k iters
preallocated 65536 pages = 262144 KiB

WARM cache:
  iterations : 50000
  p50        : 1.2 µs
  p99        : 5.8 µs
  p99.9      : 18 µs
  throughput : 1840 MB/s

dropped page cache

COLD cache:
  iterations : 50000
  p50        : 64 µs
  p99        : 180 µs
  p99.9      : 340 µs
  throughput : 56 MB/s

Two observations:

Warm cache is ~50× faster than cold. The page cache makes microbenchmarks lie. If you benchmarked a database after running the benchmark for warmup, you measured memcpy, not disk.
p99 is 4–5× p50 even on cold cache. Latency tails come from queue depth, kernel scheduling, NVMe garbage collection. This motivates io_uring (Lab 21) and request hedging in distributed systems (Lab 20).

On a Spinning Disk (if you have one)

COLD cache:
  p50        : 6.4 ms     ← 100× worse than NVMe
  p99        : 18 ms
  throughput : 0.6 MB/s   ← versus 56 MB/s on NVMe

This 100× gap is why LSM-trees exist. Random reads on HDD are unworkable for OLTP; engines either:

Avoid them (sequential append-only logs).
Hide them behind cache (large block caches + bloom filters).
Punt to SSD (HDD as cold tier only).

Throughput vs Latency

Watch what happens with iters = 1000 vs iters = 100000:

$ ./pagealloc bench /tmp/bench.bin 65536 1000
WARM throughput : 4200 MB/s

$ ./pagealloc bench /tmp/bench.bin 65536 100000
WARM throughput : 1800 MB/s

Higher iteration counts include more cache eviction (as the random distribution gradually evicts pages we already cached), exposing memory bandwidth and TLB misses. Real workloads sit between these. A single benchmark number is almost always wrong.

Try This

Add a flag to control the access pattern: sequential vs random. Sequential preads benefit from the kernel's read-ahead heuristic. On the same NVMe device you should see:

RANDOM     cold : 56 MB/s
SEQUENTIAL cold : 2400 MB/s    ← 40× faster, all due to read-ahead

This is why scans are cheap and point lookups are expensive — even on SSD.

What Just Happened

You measured the page cache, the access pattern's effect on throughput, and the gap between p50 and p99. These three insights drive every storage design in this curriculum:

Page cache exists → your in-process block cache (Lab 8) must be smarter than LRU on raw bytes, otherwise you're duplicating the kernel's work.
Sequential >> random → LSM-tree compaction (Lab 7) sorts data on disk to convert all future reads to sequential ranges.
p99 >> p50 → consensus heartbeats (Lab 17) must tolerate occasional 100× slow fsyncs without triggering elections.

You've finished Lab 01. Run docs/verification.md to confirm all 8 checks pass. Then move on to db-02-data-structures-for-storage.

Data Structures for Storage

Status: Complete. Companion to db-01 and prerequisite for db-05 (MemTable) and db-10 (B-Tree).

1. What Is It

This lab is about the in-memory data structures that databases use, and why those choices change completely when the data is on disk. We build two structures from scratch:

A skip list — an ordered, probabilistic, pointer-based map. This is what LevelDB and RocksDB use for their MemTable, and what Redis uses for sorted sets.
A hash table with open-addressing + Robin Hood probing — an unordered, array-backed map. This is what you use when you need O(1) point lookups and don't need ordering.

We then benchmark them against each other on three workloads (insert, point lookup, range scan) and explain why the numbers come out the way they do.

This lab does not implement a B-Tree or a B+-Tree — those are on-disk structures and arrive in db-10. The lesson here is why a B-Tree dominates on disk even though a skip list or hash table is faster in RAM.

2. Why It Matters

Every database has a critical-path data structure for "find this key":

System	Structure	Why
LevelDB / RocksDB MemTable	Skip list	Lock-free reads, ordered iteration for flush to SSTable
Redis sorted-set (ZSET)	Skip list + hash table	Skip list for ranked access, hash for O(1) by-key
Memcached, Java `HashMap`	Open-addressing hash table	Unordered, point lookup only
SQLite, PostgreSQL, InnoDB	B+-Tree	On-disk: minimize page reads
Cassandra, ScyllaDB MemTable	Skip list	Same reasoning as LevelDB
Lucene postings	Skip list + delta-encoded arrays	Range scans over sorted doc IDs

If you pick the wrong structure you don't get "a little slower" — you get "100× slower" or "we run out of memory at 100M keys." The cost model for an in-memory structure (cycles, cache misses) is not the cost model for an on-disk one (page reads, sync latency), and a structure tuned for one will lose badly in the other domain.

3. How It Works

Skip list

A skip list is a stack of linked lists. The bottom list contains every key in sorted order. Each higher list contains a random subset of the keys below it, sampled with probability p (we use p = 0.5). To find a key, you walk right on the highest level until you'd overshoot, then drop down a level, repeat.

level 3:  HEAD ────────────────────────────────────────────────►  NIL
              │                                                  │
level 2:  HEAD ────────► [13] ───────────────► [42] ────────────► NIL
              │           │                     │                │
level 1:  HEAD ──► [7] ─► [13] ─► [21] ───────► [42] ─► [55] ──► NIL
              │     │     │       │             │       │       │
level 0:  HEAD ► [3]►[7]►[13]►[21]►[34]►[39]►[42]►[51]►[55] ──► NIL

Searching for 39 from the top:

L3: HEAD → NIL (overshoot from HEAD), drop to L2
L2: HEAD → 13 → 42? 42 > 39, drop to L1
L1: 13 → 21 → 42? 42 > 39, drop to L0
L0: 21 → 34 → 39. Found.

Expected time: O(log n) with constant factor 1/p · log_{1/p}(n). With p = 0.5 that's 2 · log₂ n comparisons.

Hash table (open addressing, Robin Hood)

An array of 2^k slots. Hash the key, mod by table size, that's the home slot. If occupied by a different key, probe linearly (slot+1, slot+2, ...). Robin Hood twist: when you probe past slot i for the d-th time, and the resident at slot i has been probed only d' < d times, swap them — the "rich" entry gets displaced by the "poor" one. This bounds the worst-case probe distance to roughly the mean.

home(K1)=2   home(K2)=2   home(K3)=4
  hash before insert:
  ┌──┬──┬────┬────┬────┬──┬──┬──┐
  │  │  │ K1 │ K2 │ K3 │  │  │  │
  └──┴──┴────┴────┴────┴──┴──┴──┘
   0  1   2    3    4   5  6  7
  (K1 home=2 dist=0, K2 home=2 dist=1, K3 home=4 dist=0)

  insert K4 with home=2:
   probe slot 2 (K1, dist 0); K4 dist=0; equal — keep going
   probe slot 3 (K2, dist 1); K4 dist=1; equal — keep going
   probe slot 4 (K3, dist 0); K4 dist=2 > 0 — STEAL, K3 displaced
   probe slot 5 (empty); place K3 with dist=1
  result:
  ┌──┬──┬────┬────┬────┬────┬──┬──┐
  │  │  │ K1 │ K2 │ K4 │ K3 │  │  │
  └──┴──┴────┴────┴────┴────┴──┴──┘

Loads up to ~0.9 work well with Robin Hood; we resize at 0.85 (× 2 capacity, rehash all).

When in-memory and on-disk diverge

A skip-list node holding a 16-byte key + 16-byte value + 4 forward pointers is ~64 bytes in memory. The same record packed into a B+-Tree leaf page (4 KiB page, no per-record pointers) is ~36 bytes — no level header, no forward-pointer array. And the B+-Tree co-locates ~100 records in one page, so a range scan of 100 keys is 1 page read instead of 100 random reads.

On disk:

A pointer is an 8-byte offset that triggers a page read (~100 µs cold).
A cache miss is ~100× a cache hit.
A page is 4 KiB whether you read 1 byte or 4096.

Therefore on-disk structures want high fan-out, low height, contiguous siblings, no random pointer chasing. Skip lists violate all four; B+-Trees satisfy all four.

4. Core Terminology

Term	Definition
Skip list	Probabilistic ordered map: stack of linked lists with geometric level distribution
Level	Index of a forward-pointer array in a skip-list node (0 = bottom, dense; higher = sparser)
`p`	Per-level promotion probability (we use 0.5)
Sentinel head	Dummy node with the maximum possible level; all searches start here
Open addressing	Collision resolution: probe other slots in the same array (vs chaining into a list)
Linear probing	Probe sequence is `h, h+1, h+2, …` (best cache behavior)
Robin Hood	On insert, displace any resident whose probe distance is smaller than the newcomer's
Probe distance / PSL	Slots between a key's home slot and its actual slot (probe sequence length)
Load factor	`len / capacity`
Tombstone	Sentinel marking a deleted slot so probes don't short-circuit (we use backward-shift deletion instead)
Backward-shift deletion	After deleting a slot, shift the following non-home entries left by one; avoids tombstones
Cache line	64 bytes on x86_64 / Apple Silicon; the unit the CPU fetches from RAM
Pointer chasing	A traversal whose next address depends on the byte just loaded; CPU cannot prefetch
Fan-out	Number of children per internal node in a tree; B+-Trees aim for hundreds

5. Mental Models

"A skip list is a binary search you can mutate cheaply"

A balanced BST and a skip list have the same asymptotic complexity. The skip list wins because it has no rebalancing: no rotations, no recoloring. Each insert is one geometric coin flip + N forward-pointer writes (N = node height, ≈ log₂ n expected). This makes it much easier to make concurrent — LevelDB's MemTable allows lock-free reads while a writer inserts, because a partially-published node is invisible until the bottom-level pointer is CAS'd in.

"A hash table is a sparse array you pretend is dense"

If keys were integers in [0, N) you'd use an array. A hash function fakes that: it maps any key into [0, capacity). Collisions are the cost of the lie. Robin Hood specifically equalizes the cost of the lie across keys, so the worst-case lookup is close to the average.

"Cost models differ by 5 orders of magnitude"

L1 cache hit                        ~1 ns
Main memory                       ~100 ns
NVMe SSD random read (cold)       ~100 µs   ← 1000× RAM
HDD seek                           ~10 ms   ← 100× SSD
Cross-DC round trip                ~50 ms

A "fast" in-memory structure becomes irrelevant if it issues 10 page reads where a B+-Tree issues 1. The B+-Tree's "slow" O(log_B n) with B = 256 beats the skip list's "fast" O(log₂ n) on disk by a factor of log₂(256) = 8 — every level you avoid saves 100 µs.

6. Common Misconceptions

"Skip lists are slow because they're probabilistic." No — the expected and with-high-probability bounds are tight. The variance is small for any list above ~1000 keys. Failure modes are bad RNG seed (we use a deterministic xorshift here) and adversarial key insertion patterns (irrelevant for hashed keys; mitigated by per-instance seed).
"Hash tables have O(1) worst case." Average, not worst. A pathological hash function or adversarial keys produce O(n) chains. Robin Hood mitigates variance but does not change the worst case.
"You should always use a hash table when you don't need ordering." Two cases where skip lists or trees win even unordered: (a) you need iteration in any deterministic order across runs; (b) memory is tight and you can't afford the 15–40% slack of a hash table at safe load factors.
"Open addressing wastes memory because of empty slots." Chaining wastes more in practice: every chain node is a heap allocation with a header (malloc arenas + pointer + next ptr ≈ 32 bytes overhead per entry). Linear-probing hash tables with 70% load factor still use less memory than std::unordered_map.
"A B-Tree is just a balanced BST with more children." No: B+-Trees keep all data in leaves and chain leaves with sibling pointers, making range scans O(1) per page after the initial descent.
"std::unordered_map / Go map / Python dict are the gold standard." They're general-purpose. Specialized hash tables (Abseil's flat_hash_map, Rust's hashbrown, F14) beat them by 2–5× on most workloads. Database authors often roll their own.

7. Interview Talking Points

"Why does LevelDB use a skip list for the MemTable instead of a red-black tree?" → Lockless reads via single-pointer CAS publish; no rotations; easier to implement correctly.
"Why isn't a hash table good enough for a MemTable?" → MemTable flushes to an SSTable, which is a sorted file. A hash table would require sorting at flush time (O(n log n)); a skip list is already sorted, so flush is O(n) sequential.
"When would you use chaining vs open addressing?" → Open addressing for small fixed-size values (better cache); chaining when values are large and you want pointer stability across resizes.
"What's the cost model on disk that breaks skip lists?" → Each level traversal is a potential page read. With log₂ n levels you pay log₂ n × 100 µs. A B+-Tree with fan-out 256 has log_256 n levels, so 3 page reads for 16 M keys vs 24.
"Why is Robin Hood probing useful?" → Bounds variance: the maximum probe distance grows as O(log n) w.h.p., and lookups for missing keys become almost as fast as hits because you can stop as soon as you see a slot with smaller PSL than yours.
"What's the alternative to tombstones?" → Backward-shift deletion: walk forward from the deleted slot, shift each non-home entry left until you hit an empty slot or a home-slotted entry. O(probe-length) per delete, no tombstone bookkeeping.

8. Connections to Other Labs

db-01 Storage Primitives — the page abstraction the disk structures use.
db-03 WAL — the WAL is appended before the MemTable insert; failure recovery rebuilds the MemTable by replaying the WAL.
db-04 Bloom Filters — sits in front of the SSTable; same family of probabilistic in-memory structures.
db-05 LSM MemTable — uses the skip list from this lab, adds a snapshot / immutable flip.
db-10 B-Tree Fundamentals — contrast with this lab; same problem, different cost model.
db-21 Storage Engine Advanced — concurrent skip list (CAS publish), concurrent hash table (extendible hashing, lock striping).

References — db-02 Data Structures for Storage

Skip lists

Pugh, W. (1990). "Skip Lists: A Probabilistic Alternative to Balanced Trees." CACM 33(6). The original paper; six pages, very readable.
https://www.cs.umd.edu/~pugh/galileo/papers/CACM_Skiplist_1990.pdf
LevelDB MemTable source — skip list with a single allocator arena.
https://github.com/google/leveldb/blob/main/db/skiplist.h
RocksDB InlineSkipList — production skip list with per-node tail allocation.
https://github.com/facebook/rocksdb/blob/main/memtable/inlineskiplist.h
Redis t_zset.c — skip list with per-node span field for O(log n) rank queries.
https://github.com/redis/redis/blob/unstable/src/t_zset.c

Hash tables

Celis, P., Larson, P.-Å., Munro, J. I. (1985). "Robin Hood Hashing." FOCS '85. Original paper.
Pedro Celis's thesis on Robin Hood hashing (1986). The probe-distance analysis is here.
https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf
Emmanuel Goossaert's deep dive — accessible and runnable.
https://codecapsule.com/2013/11/11/robin-hood-hashing/
Google Abseil flat_hash_map — SIMD probing on top of open addressing.
https://abseil.io/about/design/swisstables
Rust hashbrown — port of Abseil's SwissTable; the implementation behind std::collections::HashMap.
https://github.com/rust-lang/hashbrown

Cost models & cache

Drepper, U. (2007). "What Every Programmer Should Know About Memory."
https://akkadia.org/drepper/cpumemory.pdf
Jeff Dean's "Numbers Every Programmer Should Know" — the canonical latency hierarchy.
https://gist.github.com/jboner/2841832
Pavlo, A. "Database Storage I" (CMU 15-445). The disk-vs-RAM cost-model lecture.
https://15445.courses.cs.cmu.edu/fall2023/slides/03-storage1.pdf

Tree structures (preview for db-10)

Bayer, R., McCreight, E. (1972). "Organization and Maintenance of Large Ordered Indices." Acta Informatica. The original B-Tree paper.
Comer, D. (1979). "The Ubiquitous B-Tree." ACM Computing Surveys. The canonical survey.

Background

Sedgewick & Wayne, Algorithms 4th ed.
Cormen, Leiserson, Rivest, Stein, Introduction to Algorithms (CLRS) 3rd ed. — Ch. 11 (hash tables), Ch. 17 (amortized analysis).

Analysis — Design Decisions

Every choice here is reversible in code but irreversible in performance: change one, all the others bend with it.

D1. Skip list over balanced BST (red-black, AVL)

	Skip list	Red-black tree
Insert / lookup	`O(log n)` expected	`O(log n)` worst case
Implementation LOC	~120 (Rust)	~400+
Concurrent reads	Lock-free with seqlock or single-CAS publish	Requires hand-over-hand locking
Cache locality	Poor (pointer chasing)	Poor (pointer chasing)
Worst-case bound	w.h.p., not absolute	Absolute

Choice: skip list. The simplicity (no rotations, no rebalancing, no parent pointers) is the value proposition. Worst-case absolute bound is irrelevant when we control the hash that feeds keys in.

When you'd flip: real-time systems where a probabilistic O(n) blowup is unacceptable. Database MemTables don't qualify.

D2. Hash table: open addressing + Robin Hood, not chaining

	Open addressing (linear)	Chaining
Memory per entry	(1/loadfactor) × sizeof(slot)	sizeof(entry) + sizeof(ptr) + malloc overhead
Cache misses per lookup	0–1 typically	1–3 typically
Pointer stability across resize	No	Yes
Deletion	Backward-shift or tombstone	Free a list node
Tail latency at high load	Spikes near 1.0	Degrades gracefully

Choice: open addressing, linear probing, Robin Hood. The cache-miss saving is decisive at small/medium values, which is the database use case.

When you'd flip: large values (≥1 KiB) where you want pointer stability so external references survive resize.

D3. `p = 0.5` for skip list, not `p = 0.25`

The expected number of comparisons per search is (1/p) · log_{1/p}(n). Minimizing this gives p = 1/e ≈ 0.37. In practice:

`p`	Expected comparisons (n=1M)	Levels	Memory per node
0.25	~16	~10	1.33 forward ptrs avg
0.5	~20	~20	2 forward ptrs avg
1/e	~14	~13	1.58 forward ptrs avg

We pick p = 0.5 because (a) bit-shift sampling (count trailing zeros of a random u64) is one instruction, and (b) the code stays trivial. The 30% theoretical improvement from p=1/e is not worth the table-lookup or log math.

D4. Max level = 32

A skip list with p=0.5 and n entries has expected max level log₂ n. At max level 32 we support n = 2^32 ≈ 4 G entries before quality degrades. Going higher costs a forward-pointer slot per node forever (8 bytes per extra level). Going lower caps the structure size.

D5. Hash function: FNV-1a 64-bit

We need a hash that is:

High quality enough for non-adversarial keys (passes basic distribution tests)
Fast for small keys (~10 cycles per 8 bytes)
Identical in all three languages so cross-language tests can compare counts

FNV-1a is 6 lines of code, deterministic, and produces nearly the same probe-distance distribution as xxHash3 for keys ≤ 32 bytes. We use it because we control the input keys in this lab; in production you'd switch to xxHash3 or SipHash.

When you'd flip: keys controlled by adversaries → SipHash with a per-instance random key.

D6. Load factor 0.85, grow ×2

Robin Hood handles load factor up to ~0.9 well; beyond that the variance of probe distance blows up. We resize at 0.85 to leave headroom and double the capacity (and rehash). Cost of resize is amortized O(1) per insert.

D7. Backward-shift deletion, no tombstones

Tombstones simplify code but bloat the table over time — a "delete-then-insert" workload fills the table with markers and forces a resize. Backward-shift deletion costs O(PSL) per delete (typically <5 slots) and keeps the table compact. The implementation walks forward from the deleted slot, moves each entry one slot left until reaching an empty slot or an entry whose PSL is 0.

D8. Seed: deterministic per-instance, random across instances

The skip list RNG must not be deterministically the same across runs. But within one process run we want reproducible behavior for debugging. The CLI accepts an optional seed; tests pass a fixed value.

Cost-model cheat sheet

Operation	Latency
L1 cache hit	~1 ns
L2 cache hit	~4 ns
L3 cache hit	~15 ns
Main memory	~100 ns
4 KiB page from SSD (cold)	~100 µs
4 KiB page from HDD	~10 ms

Every random pointer dereference is a potential L3 miss → 100 ns. A skip list with 20 levels traverses 20 such pointers in the worst case = 2 µs. A linear-probing hash table touches 1–2 cache lines = ~10 ns. Hash table is ~100× faster for point lookups in RAM, and the gap grows as the working set leaves L3.

What breaks at scale

Symptom	Cause	Mitigation
Skip-list lookup tail latency 10× worse than median	Bad RNG sequence; node height variance	Use a higher-quality PRNG; bound max height
Hash table tail latency spikes near 90% load	Robin Hood variance explodes	Resize earlier (load factor 0.75)
Skip-list memory 2× of equivalent BST	Forward-pointer array overhead	Use per-level arena allocators; pack pointer arrays
Hash table grows but never shrinks	Resize is unidirectional in our impl	Shrink when load < 0.25 (we don't — extension point)
Iterator skips entries	Mutation during iteration	Snapshot at iterator-creation time (db-05 covers this)

Execution — How to Build and Run

Quick start (per language)

# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/dsbench --help

# Go
cd src/go
go test ./...
go build -o bin/dsbench ./cmd/dsbench
./bin/dsbench --help

# C++
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build
./build/dsbench --help

CLI: `dsbench`

A single binary per language that exercises both data structures.

Subcommand	Args	What it does
`skiplist insert N [seed]`	N (count)	Inserts N keys, prints final size + max level + histogram
`skiplist roundtrip N`	N	Inserts N keys, verifies every key reads back, then removes them
`skiplist iter N`	N	Inserts N random keys, prints all keys in iterator order
`hashtable insert N`	N	Inserts N keys, reports load factor + max probe distance + histogram
`hashtable roundtrip N`	N	Insert + verify + delete + verify gone
`bench point N`	N	Inserts N keys into both, benchmarks point lookups for each
`bench mem N`	N	Reports bytes-per-entry for both structures

Library API

Same shape in all three languages.

SkipList::new(seed)            -> SkipList
SkipList::insert(key, value)   -> bool     (true if newly inserted, false if replaced)
SkipList::get(key)             -> Option<value>
SkipList::remove(key)          -> bool
SkipList::len()                -> usize
SkipList::iter()               -> sorted iterator over (key, value)

HashTable::new(capacity)       -> HashTable
HashTable::insert(key, value)  -> bool
HashTable::get(key)            -> Option<value>
HashTable::remove(key)         -> bool
HashTable::len()               -> usize
HashTable::load_factor()       -> f64
HashTable::max_probe()         -> usize

Keys and values are byte strings.

Verifying

./scripts/verify.sh        # invariants per structure
./scripts/cross_test.sh    # cross-language behavioral checks

Observation — Looking Inside the Structures

What you should see

This lab is at its best when you stop trusting the numbers and start looking at the memory layout.

1. Histogram of skip-list node heights

The skip list with p=0.5 should produce a geometric distribution: ~half the nodes at level 0 only, ~quarter reaching level 1, etc.

./target/release/dsbench skiplist insert 100000

Sample output:

level   count    %
    0  50032   50.0   ████████████████████████████████████████████████
    1  25021   25.0   █████████████████████████
    2  12508   12.5   █████████████
    3   6234    6.2   ██████
    4   3098    3.1   ███
    5   1581    1.6   ██
    6    778    0.8   █
    ...
   max level used = 16

If your distribution is skewed (e.g., level 0 is 25% instead of 50%) your RNG or sampling code is wrong. The most common bug is rng() & 1 evaluated once and reused.

2. Hash-table probe-distance histogram

./target/release/dsbench hashtable insert 1000000

Sample output (Robin Hood, load 0.477):

probe distance   count       %
            0   633412     63.3
            1   235108     23.5
            2    87412      8.7
            3    32104      3.2
            4    10001      1.0
            5     1652      0.2
            6      298      0.0
            7       12      0.0
            8        1      0.0
   mean = 0.55   max = 8   capacity = 2097152   load = 0.477

With pure linear probing (no Robin Hood) the tail extends much further.

3. Memory accounting

./target/release/dsbench bench mem 1000000

Sample:

skip list  : 1,000,000 entries, ~80 B/entry
hash table : 1,000,000 entries, ~25 B/entry  (cap=2097152, load=0.477)

What "working" looks like

Skip list: max level grows like log₂ n + 5 (slight overshoot from variance).
Hash table: mean probe < 1, max probe < 10 at load ≤ 0.85.
Bench: hash table 5–20× faster than skip list on point ops.
Memory: hash table is 2–4× smaller per entry.

What "broken" looks like

Mean probe distance climbs above 2 → poor hash function or table not actually power-of-two-sized.
Max skip-list level stuck at 1–2 with 1M entries → RNG broken; bit-test always falls through.
Same level distribution from one run to the next → seed not random.
Hash table size doesn't grow after 85% load → resize trigger not firing.
max_probe 3–4× above the theoretical bound → almost always the hash function. We hit this with raw FNV-1a 64-bit (max_probe ≈ 200 at N=100k vs expected ≤ 66). Adding a SplitMix64 finalizer fixed it. The pure-FNV variant typoed its prime constant too — see step 02 for the canonical value 0x00000100000001b3 and the three pinned vectors ("", "a", "foobar").

What `scripts/cross_test.sh` proves

Runs dsbench skiplist iter N seed in Go, Rust, and C++ and diffs all three outputs. If they aren't byte-identical, one of the three has drifted on hash function, RNG, or ordering — usually the easiest single signal for catching a port regression.

Verification — Invariants

scripts/verify.sh runs the language-default binary (Rust by default) through these checks. scripts/cross_test.sh then re-runs the same scenarios in Go and C++ and asserts the behaviorally observable outputs match. The internal layouts are not required to match — only the API behavior.

#	Invariant	How verified
V1	Skip list round-trip: `insert(k, v)` then `get(k) == v` for all k	`dsbench skiplist roundtrip 10000`
V2	Skip list iteration is in sorted order	`dsbench skiplist iter 1000` piped to `sort -c`
V3	Skip list level distribution is geometric (`p=0.5 ± tolerance`)	histogram chi-square check in unit test
V4	Skip list max level stays ≤ MAX_LEVEL even with 100k inserts	bench reports `max_level_used`
V5	Hash table round-trip: `insert(k, v)` then `get(k) == v` for all k	`dsbench hashtable roundtrip 10000`
V6	Hash table max probe distance ≤ `4 × log₂(cap × load)` at load ≤ 0.85	unit test asserts
V7	Hash table resizes at load 0.85, capacity doubles, max-probe drops	unit test
V8	Backward-shift deletion never leaves a lookup hole	unit test: insert 100, delete random 50, assert remaining 50 still found
V9	Insert with same key twice replaces value, `len()` unchanged	unit test
V10	Cross-language: insert sequence `[5, 1, 3, 8, 2]` into skip list, iter output is sorted in all 3 langs	`cross_test.sh`
V11	Cross-language: hash table after inserting same 1000 keys reports same `len()`	`cross_test.sh`
V12	Cross-language: roundtrip of the same N keys works in all 3 langs	`cross_test.sh`

Running

./scripts/verify.sh          # ~5s, runs Rust binary
DSE_BIN=./src/go/bin/dsbench ./scripts/verify.sh
DSE_BIN=./src/cpp/build/dsbench ./scripts/verify.sh

./scripts/cross_test.sh      # builds all 3, runs cross-language checks

What to do when a check fails

Failure	Most likely cause
V1, V5	Off-by-one in insert path; key not normalized
V2	Skip list `level 0` chain has out-of-order pointer write
V3	RNG broken: same bit pattern reused per call
V4	Level cap not enforced
V6	Hash table not actually power-of-two; `home_slot = hash % cap` not masking
V7	Load-factor check uses `>` instead of `>=`, or `len` not decremented on delete
V8	Tombstones left in array; backward-shift loop terminates too early
V11	Hash function differs across langs — must be FNV-1a in all three

Broader Ideas — Where to Go Next

Extensions you can build on top of this lab. Each is a 0.5–2 day exercise.

1. Concurrent skip list with lock-free reads

LevelDB's MemTable allows concurrent readers while a single writer inserts. The trick: nodes are made visible by a single atomic store of the bottom-level forward pointer — once that store lands, the node exists; before, it doesn't. Higher-level pointers can race because they're only used to speed up a search; if they point to a not-yet-visible node, the next compare won't match and the search retries.

Implement: an AtomicPtr per forward pointer, a single writer (enforced by external mutex or compare_exchange), no per-node lock. Test: spawn 8 readers + 1 writer, run for 10s, assert no reader observes a partial node.

2. Concurrent hash table: lock striping + extendible hashing

Lock striping: 64 stripe mutexes; key's stripe = hash & 63. A write locks its stripe; reads either lock-read or use seqlock counters.

Extendible hashing: instead of full-table resize, split one bucket at a time when it grows past a threshold.

3. ART (Adaptive Radix Tree)

A radix tree variant that uses 4 different internal node layouts (4, 16, 48, 256 children) and adapts based on density. Wins for variable-length keys with shared prefixes (URLs, paths). [Leis et al., ICDE 2013].

4. Cuckoo hashing

Two hash functions, two candidate slots per key. Lookups are guaranteed 2 reads. Used in Memcached extensions.

5. Hopscotch hashing

Each entry must live within H slots of its home (typically H=32). Bounded probe distance like Robin Hood with stronger guarantee.

6. B+-Tree (preview db-10)

Write an in-memory B+-Tree with fan-out 64, leaves chained, and compare to skip list on range scans. This sets up the "why on-disk B-Trees beat skip lists" intuition before you've touched disk.

7. Skip list with rank queries (Redis `ZRANK`)

Add a span field per forward pointer = "number of bottom-level nodes this pointer skips over." Now rank(key) is O(log n) instead of O(n). ~50 LOC of additions.

8. Bloom filter (preview db-04)

In front of the hash table: a 1-bit-per-position array sized for ~1% false-positive rate at expected N. Measure: at 50%-miss workloads the Bloom filter saves you cache misses; at 95%-hit workloads it's pure overhead.

9. xor / cuckoo / ribbon filters

Modern variants (xor [Graf & Lemire 2020], ribbon [Dillinger 2021]) get the same false-positive rate as Bloom with 25–35% less memory.

10. Cache-conscious skip list

Replace the per-node forward-pointer array with a contiguous tail allocation (RocksDB's InlineSkipList). Compare cache miss rates: same algorithm, half the misses.

11. Persistent / immutable variants

Build an immutable skip list where insert returns a new root, sharing 99% of nodes with the old one. Useful for snapshots, MVCC.

When you've explored a couple, you're ready for db-03 Write-Ahead Log, where the durability story begins.

Step 1 — Implement the Skip List

Goal

Build a sorted map with O(log n) expected insert/lookup/remove and O(n) ordered iteration.

API

SkipList::new(seed: u64) -> SkipList
SkipList::insert(key, value) -> bool   // false if replaced existing
SkipList::get(key) -> Option<value>
SkipList::remove(key) -> bool
SkipList::len() -> usize
SkipList::iter() -> iterator<(key, value)>
SkipList::max_level_used() -> usize
SkipList::level_histogram() -> [usize; MAX_LEVEL]

Constants

MAX_LEVEL = 32
P = 0.5 (sample via count_trailing_zeros(rng()) % MAX_LEVEL)

Data layout

SkipList {
    head:    Node*       // dummy sentinel at MAX_LEVEL
    level:   usize       // current max level used (1..=MAX_LEVEL)
    len:     usize
    rng:     u64         // xorshift64 state
}

Node {
    key:      Vec<u8>
    value:    Vec<u8>
    forward:  Vec<Option<Box<Node>>>   // length = height of this node
}

In Rust we use Box<Node> for ownership and raw pointers for siblings (or Option<NonNull<Node>> for safer raw pointers). In Go, *Node is the natural choice. In C++, std::unique_ptr<Node> for the sole owner of level-0 next, raw pointers for higher levels.

For simplicity we use a single ownership style: level-0 owns nodes; higher levels hold raw pointers. Drop walks the level-0 chain once.

Insert pseudocode

update[0..MAX_LEVEL] = HEAD
x = head
for i in (level-1)..=0:
    while x.forward[i] != null && x.forward[i].key < key:
        x = x.forward[i]
    update[i] = x

if x.forward[0] != null && x.forward[0].key == key:
    x.forward[0].value = value
    return false                       // replaced

new_level = random_level()
if new_level > level:
    for i in level..new_level: update[i] = HEAD
    level = new_level

node = new Node(key, value, new_level)
for i in 0..new_level:
    node.forward[i] = update[i].forward[i]
    update[i].forward[i] = node
len += 1
return true

Random level

fn random_level(rng: &mut u64) -> usize {
    *rng ^= *rng << 13;
    *rng ^= *rng >> 7;
    *rng ^= *rng << 17;
    let lvl = (rng.trailing_zeros() as usize) + 1;
    min(lvl, MAX_LEVEL)
}

Tests

#	Test	Pass if
T1	insert 1000 random keys, all `get` succeed	every value matches
T2	insert sorted keys 0..999, iter yields 0..999	strictly increasing
T3	insert + remove all keys, len = 0	empty
T4	insert with same key twice, len unchanged	replacement worked
T5	level distribution at N=100k is geometric	sum of L≥k slots ≈ N · 2^-k

Step 2 — Implement the Hash Table

Goal

Open-addressing hash table with linear probing + Robin Hood + backward-shift deletion.

API

HashTable::new(initial_capacity_pow2: usize) -> HashTable
HashTable::insert(key, value) -> bool          // false if replaced
HashTable::get(key) -> Option<value>
HashTable::remove(key) -> bool
HashTable::len() -> usize
HashTable::capacity() -> usize
HashTable::load_factor() -> f64
HashTable::max_probe() -> usize
HashTable::probe_histogram() -> Vec<usize>

Hash function

FNV-1a 64-bit followed by a SplitMix64 finalizer (identical in all three languages):

offset = 0xcbf29ce484222325
prime  = 0x00000100000001b3        # = 1_099_511_628_211
h = offset
for byte in key:
    h ^= byte
    h = h * prime  (wrapping)
return splitmix64_finalize(h)

fn splitmix64_finalize(mut h: u64) -> u64 {
    h ^= h >> 30; h = h.wrapping_mul(0xbf58476d1ce4e5b9);
    h ^= h >> 27; h = h.wrapping_mul(0x94d049bb133111eb);
    h ^= h >> 31;
    h
}

Plain FNV-1a has notoriously poor avalanche on short, sequential keys — running it raw against Robin Hood probing blows up max_probe (we measured 206 vs the expected ≲66 bound at N=100k). The SplitMix64 finalizer is bijective (adds no collisions) and re-mixes the high bits down, which restores the geometric PSL distribution.

Known-answer vectors (pin these in tests across all three languages):

key	hash
`""`	`0xf52a15e9a9b5e89b`
`"a"`	`0x02c0bdbf481420f8`
`"foobar"`	`0x404da9e3b74078c2`

Slot layout

Slot {
    occupied: bool        // or use psl = MAX as sentinel
    psl:      u32         // probe sequence length
    hash:     u64
    key:      Vec<u8>
    value:    Vec<u8>
}

We store the full 64-bit hash inside each slot so resizes don't re-hash keys, and so that get can compare hashes (cheap) before keys (expensive).

Insert (Robin Hood)

if (len + 1) / capacity > 0.85: resize(capacity * 2)

h = hash(key)
i = h & (capacity - 1)
psl = 0
loop:
    if !slots[i].occupied:
        slots[i] = (key, value, h, psl); occupied; len += 1
        return true
    if slots[i].hash == h && slots[i].key == key:
        slots[i].value = value
        return false                 // replaced
    if slots[i].psl < psl:
        swap(slots[i], (key, value, h, psl))   // steal
    i = (i + 1) & (capacity - 1)
    psl += 1

Get

h = hash(key)
i = h & (capacity - 1)
psl = 0
loop:
    if !slots[i].occupied: return None
    if slots[i].psl < psl: return None      // would have stolen
    if slots[i].hash == h && slots[i].key == key:
        return Some(slots[i].value)
    i = (i + 1) & (capacity - 1)
    psl += 1

Remove (backward-shift)

i = find_slot(key)
if not found: return false
loop:
    j = (i + 1) & (capacity - 1)
    if !slots[j].occupied || slots[j].psl == 0:
        slots[i].occupied = false
        len -= 1
        return true
    slots[i] = slots[j]
    slots[i].psl -= 1
    i = j

Resize

new_slots = [empty; capacity * 2]
old = swap(slots, new_slots); capacity *= 2; len = 0
for slot in old where occupied:
    insert(slot.key, slot.value)   // re-uses hash if you cache it

Tests

#	Test	Pass if
T1	insert + get of 10k random keys	all hits
T2	insert 10k, remove 5k random, get remaining 5k	all still found, removed not found
T3	insert past 85% load triggers resize	capacity doubled
T4	duplicate insert replaces value, len unchanged	replacement worked
T5	max PSL ≤ 4·log₂(cap·load) at 1M inserts	bounded variance

Step 3 — Benchmark and Compare

Goal

Quantify the cost difference between the two structures on three workloads.

Workloads

Name	Op	Measured
`point`	get(key) where key was previously inserted	ns/op + cache misses (if `perf`)
`point-miss`	get(key) where key is absent	ns/op
`range`	iterate from `lo` until `hi`	ns/key (skip list only)

Procedure

Seed both structures with N (10k, 100k, 1M, 10M) random 8-byte keys + 8-byte values.
For each workload, take iters random accesses, time the loop, divide by iters.
Report p50, p99, mean, total throughput.

Expected outcomes (M2 Pro, release, N=1M)

Op	Skip list	Hash table	Ratio
Insert	~700 ns	~95 ns	7×
Point hit	~450 ns	~25 ns	18×
Point miss	~500 ns	~30 ns	17×
Range scan 1000 from key	~25 µs	N/A	—
Bytes/entry	~80	~25	3.2×

What the numbers prove

For unordered point access, never use a skip list. The factor-of-20 gap is from cache-miss count: hash table touches ~1 line, skip list touches ~20 (one per level).
For ordered access, the skip list is the only option of the two. Range scans on a hash table require collecting all entries and sorting — O(n log n) setup vs O(k) for the skip list.
The memory gap is real and gets worse for tiny values. Skip-list forward-pointer arrays dominate when value size is < ~64 B.

How to run

./target/release/dsbench bench point 1000000
./target/release/dsbench bench mem   1000000

Optional: cache-miss measurement

# Linux
perf stat -e cache-misses,cache-references ./target/release/dsbench bench point 1000000

# macOS (Instruments → Counters template, capture by PID)
xcrun xctrace record --template 'Counters' --launch ./target/release/dsbench bench point 1000000

Write-Ahead Log (WAL)

Status: complete — runnable in Rust, Go, C++.

1. What Is It

A WAL is an append-only file that records intent-to-modify before the actual data pages are updated. On crash, the recovery routine replays the WAL from the last checkpoint, restoring the database to a consistent state.

client write  ──►  append record  ──►  fsync(WAL)  ──►  ack client
                                                          │
                                                          ▼
                                                  later: apply to data file
                                                  later: checkpoint & truncate WAL

The critical invariant is the write order: WAL hits stable storage before the client is told the write committed. If the process dies between the fsync and the data-page update, recovery re-applies the logged operation. If it dies before the fsync, the client never got an ack, so losing the record is acceptable.

2. Why It Matters

Without WAL	With WAL
Random in-place writes to the data file	Sequential append (10–100× faster on HDD, still better on SSD)
Each commit = random-page fsync	Each commit = single sequential append + fsync
Crash mid-update ⇒ torn page, corrupt file	Crash ⇒ replay log, idempotent recovery
Group commit impossible	Multiple commits batched into one fsync ("group commit")

Every serious database has one: PostgreSQL's WAL, MySQL's redo log, SQLite's WAL mode, RocksDB's WAL, LevelDB's LOG file, Kafka's segments.

3. How It Works

Record framing used in this lab (mirrors LevelDB's simplified record format minus the block-grouping):

 ┌─────────┬─────────┬──────────────────────┐
 │ len(u32)│ crc(u32)│ payload (len bytes)  │
 └─────────┴─────────┴──────────────────────┘
       4         4              N

len and crc are little-endian.
crc is CRC-32 (IEEE 802.3 polynomial, reflected) of the payload only.
Records are written back-to-back with no padding.
Recovery iterates from offset 0 and stops at the first record whose header is short, whose payload is short, or whose CRC fails. The valid prefix is replayed; the bad tail is silently truncated on next open.

file:
   ┌───┬───┬─────┬───┬───┬─────┬───┬───┬──┐  ← crashed mid-write
   │L=8│CRC│ A…  │L=4│CRC│ B…  │L=9│CRC│??│
   └───┴───┴─────┴───┴───┴─────┴───┴───┴──┘
                                ▲          ▲
                                │          └─ short payload → stop, truncate from here
                                └─ last fully-flushed record

4. Core Terminology

Term	Definition
Record	Self-describing unit: header (len + crc) followed by payload bytes.
`fsync`	Syscall asking the kernel to flush dirty pages and inode metadata to disk.
`fdatasync`	Like `fsync` but skips metadata if only data changed. Slightly faster.
Group commit	Coalescing multiple in-flight appends into one shared `fsync`.
Torn write	A write the device split into two physical sectors, only one of which made it. CRC catches this.
Tail truncation	Scanning forward at open and discarding any partial trailing record.
Checkpoint	Flush dirty pages to data file, record a WAL position beyond which replay is unnecessary.

5. Mental Models

The log is the source of truth, the data file is the cache. Recovery reconstructs from the log.
CRC is for detection, not correction. It tells you where the good prefix ends; it does not heal damage.
fsync is a barrier, not a universal durability guarantee. Consumer SSDs and FUSE filesystems sometimes lie. Use fio --fdatasync=1 to spot-check hardware.
Sequential I/O wins. Even on SSDs, sequential writes have better SLC-cache and GC behavior than random ones.

6. Common Misconceptions

"write() already put it on disk." No — kernel page cache. fsync is required for durability.
"CRC + length is enough." Necessary, not sufficient. A record with both len and crc zeroed is indistinguishable from len=0,crc=crc32([]). We disallow len=0 (treat as EOF).
"Group commit hurts latency." Tiny median bump for a 10–100× throughput win and lower tail latency under load.
"fsync == O_DIRECT." Different layers. O_DIRECT bypasses the page cache; fsync flushes it.

7. Interview Talking Points

Distinguish redo log (PostgreSQL/InnoDB), WAL mode (SQLite WAL), journal mode (SQLite default).
Why CRC the payload, not the header? (Need length first; header CRC catches the wrong failures.)
fsync vs fdatasync vs sync_file_range.
Group commit mechanics and tradeoff.
Why WAL alone doesn't beat torn writes on the data file → full-page writes in WAL after each checkpoint.

8. Connections to Other Labs

db-01 — every append here is a pwrite + fsync from db-01.
db-05, db-09 — LSM writes always hit the WAL first.
db-11 — SQLite WAL mode reuses this exact pattern around a B-tree pager.
db-13 — commit records & 2PC live in the WAL.
db-17 — Raft's replicated log is, mechanically, a WAL.

References — Write-Ahead Log

Papers

ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. C. Mohan et al., 1992. The canonical WAL paper; introduces the redo–undo discipline still used everywhere.
The Log-Structured Merge-Tree (LSM-Tree). P. O'Neil et al., 1996. Section 2 motivates why a separate sequential log is necessary even for in-memory writes.

Code

LevelDB db/log_format.h — record types & block structure that inspired this lab's framing.
RocksDB db/log_writer.cc — production-grade group-commit implementation.
PostgreSQL src/backend/access/transam/xlog.c — full-page writes, redo machinery.

CRC

A Painless Guide to CRC Error Detection Algorithms. Ross Williams, 1993. Plain-English walk-through of polynomial division and reflected algorithms.
Linux kernel lib/crc32.c — reference table-driven implementation.

Filesystems & fsync

Can Applications Recover from fsync Failures? Anthony Rebello et al., USENIX ATC 2020. Surveys the depressing landscape of partial fsync failures.
Files are hard. Dan Luu, 2017. Survey blog post on every way fsync, rename, and friends betray you.
man 2 fsync, man 2 fdatasync, man 2 open (O_DSYNC, O_SYNC).

Analysis — Write-Ahead Log

Problem statement

Make a stream of writes durable and crash-recoverable without paying for a random in-place fsync per write.

Constraints

Constraint	Why it matters
Append-only file	Sequential I/O, ~10–100× faster than random on HDD; better SLC/GC behavior on SSD.
Self-describing records	Recovery must work without a side index. Header = length + checksum.
Truncation-tolerant tail	Crash mid-write leaves a partial record. We must detect & ignore it on next open.
Single writer	We do not address multi-writer log multiplexing here (Kafka does).
No structural guarantees from the FS	Don't assume ordering of metadata vs data, or that 4KB writes are atomic.

Design decisions

Header = len(u32 LE) || crc32(u32 LE). Small (8 B), aligned, endian-fixed. We deliberately keep the CRC out of the length so we can stream-checksum the body.
CRC is over the payload only. The header itself is implicitly validated by use — a corrupt len either points beyond EOF (short read) or to data whose CRC won't match.
len == 0 is disallowed, used as the EOF sentinel. Empty payloads are rare in practice and avoiding the ambiguity simplifies the reader (len=0,crc=0 happens naturally in a hole of zeros from a sparse file or pre-allocated extent).
Little-endian on disk. Everyone runs LE now (x86, ARM, RISC-V — even POWER prefers it). No htole32 dance saves ~5 LOC per language.
CRC table generated at startup, not hardcoded. 1 KB, computed in microseconds. Easier to audit, and lets us swap polynomials in tests.
One file, one writer, one fd. No segment rotation in this lab — that lives in db-07 (compaction) and db-09 (LevelDB). Single-file WAL is enough to teach framing.
sync() is a separate method. The caller decides commit boundaries. Production systems may add append_sync(payload) that batches a group commit; we leave that for bench mode.

Why this design over alternatives

vs LevelDB's block-grouped framing: LevelDB pads records to 32KB blocks for alignment and easier corruption isolation. Beautiful, but doubles the code volume. We follow this lab's bias of "minimum to teach the concept, plus one cross-language cross-check".
vs JSON / protobuf framing: would require schema management. CRC + raw bytes is the smallest possible recoverable framing.
vs a per-record fsync: we expose a separate sync() so the user can choose between durability per-record (call after every append) and group commit (call periodically).

Failure modes addressed

Failure	Detection
Partial header at EOF	Header read < 8 bytes ⇒ stop iteration.
Header OK, partial payload	Payload read < len ⇒ stop iteration.
Full record, CRC mismatch (bit-flip)	CRC32 over payload ≠ stored CRC ⇒ stop iteration.
Hole of zeros (sparse FS, preallocated)	`len == 0` is the EOF sentinel ⇒ stop iteration cleanly.
Disk fully lying about `fsync`	Out of scope; mention `fio --fdatasync=1` to detect.

Failure modes NOT addressed in this lab

Bit-flip in the header itself that produces a plausible (len, crc) pair (probability ≈ 2⁻³²). Production systems mitigate with a record-type byte (LevelDB) or magic bytes (Kafka).
Multi-process writers. Use O_APPEND + ≤PIPE_BUF append for that; see db-09 / db-21.
Disk full mid-write. Treat as torn write at EOF (the trailing record fails CRC and is dropped on recovery); the caller's append() returns an I/O error that they must handle.

Execution — How to Build and Run

Quick start (per language)

# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/walbench --help

# Go
cd src/go
go test ./...
go build -o bin/walbench ./cmd/walbench
./bin/walbench --help

# C++
cd src/cpp
cmake -S . -B build && cmake --build build
ctest --test-dir build
./build/walbench --help

CLI: `walbench`

A single binary per language exercising the WAL.

Subcommand	Args	What it does
`append PATH N [SIZE]`	path, count, payload bytes (default 64)	Appends N records of SIZE bytes; reports bytes/sec
`append-sync PATH N [SIZE]`	same	Same as `append` but `sync()` after each record
`read PATH`	path	Replays the log, prints `len(crc=…) ok` per record, then `OK n=… bytes=…`
`corrupt PATH OFFSET BYTE`	path, offset, byte value	Overwrites one byte in place — for testing tail tolerance
`bench-group PATH N BATCH`	path, total records, batch size	Group-commit benchmark: sync once per BATCH records

Library API

Same shape in all three languages.

Wal::open(path)            -> Wal           // creates or opens for append, scans to EOF
Wal::append(payload)       -> u64 offset    // record start offset in file
Wal::sync()                -> ()            // fdatasync (or fsync) the file
Wal::len()                 -> u64           // bytes on disk (post-append, post-sync)
Wal::close()               -> ()            // implicit on Drop in Rust / RAII in C++

Wal::iter(path)            -> iterator<Vec<u8>>   // streams records; stops at first bad/short

open scans forward at startup to (a) find true EOF after a partial-write recovery and (b) optionally truncate the file to that position so the next append doesn't append after a known-bad tail. We do truncate in this implementation — the alternative (leave the bad tail in place) makes file size useless and complicates len().

Verifying

./scripts/verify.sh        # invariants per implementation
./scripts/cross_test.sh    # write in lang A, read in lang B, all six pairs

Observation — Looking at the Bytes

1. Hexdump a freshly written WAL

./build/walbench append /tmp/wal 3 4
xxd /tmp/wal

00000000: 0400 0000 b3ca 9eb5 4141 4141 0400 0000  ........AAAA....
00000010: b3ca 9eb5 4242 4242 0400 0000 b3ca 9eb5  ....BBBB........
00000020: 4343 4343                                CCCC

What you should see:

Three 12-byte records (4 header + 4 payload * 3 = 36 bytes, but actually 8+4 = 12 each = 36 ✓).
Identical headers because every payload is "AAAA" / "BBBB" / "CCCC" — same length, different CRC.
04 00 00 00 is len = 4 in little-endian.
The next 4 bytes are the payload's CRC, also little-endian.

If your file is suspiciously large (e.g., starts with garbage 0x00 or 0xFF runs), open() is opening the file with the wrong flags or your buffer is uninitialized.

2. Group commit vs per-record sync

./build/walbench append-sync /tmp/wal 10000 64       # fsync per record
./build/walbench bench-group  /tmp/wal 10000 64 1    # group=1, same thing
./build/walbench bench-group  /tmp/wal 10000 64 64   # 64 records per fsync
./build/walbench bench-group  /tmp/wal 10000 64 512  # 512 per fsync

Sample (M2 Pro, APFS):

mode             throughput
per-record sync     1,800 records/s   (~556 µs/sync)
group=64          110,000 records/s
group=512         260,000 records/s

Two takeaways: per-record sync is 3 orders of magnitude slower; group size has diminishing returns past ~256 because the bottleneck shifts to write() itself.

3. Tail truncation in action

./build/walbench append /tmp/wal 5 16
wc -c /tmp/wal                       # 120 bytes (5 × 24)
printf '\xff\xff\xff\xff' >> /tmp/wal
wc -c /tmp/wal                       # 124 bytes
./build/walbench read /tmp/wal       # reads 5, then "stop: short header" (124-120 = 4 < 8)
./build/walbench append /tmp/wal 1 16
wc -c /tmp/wal                       # 144 bytes — open() truncated the garbage, then appended

The reopen-truncate behavior is the most easily-missed correctness detail. If it's broken, your second append ends up inside the corrupted region and the file becomes unreadable after recovery.

4. CRC sensitivity

Bit-flipping one byte of a 64-byte payload should kill the CRC of that record but leave everything before it valid:

./build/walbench append /tmp/wal 10 64
./build/walbench corrupt /tmp/wal 100 0x00     # mid-payload of record ~4
./build/walbench read /tmp/wal | tail
# expected: prints ~3 OK records then "stop: bad crc"

What "working" looks like

Hexdump shows tightly packed 8-byte-header + payload pairs, no padding.
Group commit is at least 50× faster than per-record sync.
Tail truncation works on first reopen, regardless of how much garbage you appended.

What "broken" looks like

A reader that hangs or panics on garbage — fix the bounds checks in the iter loop.
File size grows but throughput is flat — you're probably calling fsync inside append accidentally.
CRC doesn't trip on single-bit flips — wrong polynomial (likely you used the un-reflected version, see scripts/verify.sh).
Cross-language test fails — endianness or CRC table bug. Print the first 16 bytes of the file from each language and compare.

Verification — What to Test and How

Property tests (per language)

#	Test	Pass if
V1	`crc32_known_vectors`	`""` → 0x00000000; `"a"` → 0xE8B7BE43; `"123456789"` → 0xCBF43926
V2	`roundtrip_small`	append "hello" "world", iter yields exactly those two payloads
V3	`roundtrip_1000_variable`	append 1000 records of pseudo-random sizes 1..1024, iter yields identical sequence
V4	`truncated_tail`	open, append A and B, fsync, write 5 bytes of garbage past EOF, reopen ⇒ iter yields {A,B} only
V5	`corrupt_payload`	flip one bit in the payload of record 2 of 5, iter yields {1} (stops at first bad)
V6	`corrupt_header`	overwrite len of record 2 with 0xFFFFFFFF, iter yields {1}
V7	`reopen_truncates_garbage`	scenario V4 followed by a new append, total iter yields {A,B,C} and file size equals exactly the three records' total bytes
V8	`append_returns_offset`	offset returned by appendₙ equals sum of (header+payload) for records 0..n-1

Cross-language test

scripts/cross_test.sh performs a six-way matrix: for each writer ∈ {go, rust, cpp} and reader ∈ {go, rust, cpp}, write 500 records of varying sizes with a fixed seed in the writer language, read them in the reader language, assert the payload list matches exactly.

This catches:

Endianness mistakes in len/crc.
Different CRC polynomials or initial value / final XOR.
Off-by-one in header parse.
fsync not being called before the reader runs (we close the writer between phases).

Manual smoke

./build/walbench append /tmp/wal 100 64
./build/walbench read /tmp/wal | tail
# expected: OK n=100 bytes=7300
./build/walbench corrupt /tmp/wal 50 0xFF
./build/walbench read /tmp/wal | tail
# expected: stops well before record 100, reports bad record

What "passing" means

All 8 property tests green in all three languages.
cross_test.sh exits 0 (9 successful writer×reader runs).
Manual smoke: corruption stops the reader cleanly, no panic / no segfault, no infinite loop.

Broader Ideas — Beyond the Minimum

Things worth knowing that aren't in the lab code.

Block-grouped framing (LevelDB / RocksDB)

LevelDB pads records into 32 KB blocks and uses a 1-byte type field (FULL, FIRST, MIDDLE, LAST) to handle records that straddle blocks. Benefit: corruption in one block can't propagate; recovery can resync to the next block boundary. Cost: more code, slightly more space.

Group commit, properly

Real systems run a "log writer" goroutine/thread:

clients ──► append to buffer ──► wake writer ──► fsync once ──► broadcast cond var

The writer batches all records that arrived during the previous fsync into the next fsync. Latency stays bounded by (max fsync time) + (one batch fill); throughput scales until you saturate the SSD's IOPS.

`O_DSYNC` vs application-level `fsync`

O_DSYNC makes every write() durable before returning. Removes the need for explicit fsync, but you lose the chance to batch. Real DBs prefer explicit fsync for that reason.

`sync_file_range` and friends

Linux-only. sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE) flushes only a byte range. PostgreSQL uses this for "lazy" checkpoints to avoid stalling on huge fsyncs. Doesn't sync metadata, so still need a final fsync.

Pre-allocation & fallocate

For predictable I/O, pre-allocate the next WAL segment with fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE). This avoids metadata updates on each grow and gives the FS a contiguous extent. PostgreSQL pre-zeroes 16 MB segments.

Direct I/O & alignment

O_DIRECT bypasses the page cache; useful when the DB has its own buffer pool. Requires 512 B or 4 KB aligned buffers and offsets. Modern recommendation: prefer io_uring + O_DIRECT over POSIX AIO. Returns in db-21.

Mixing data files and WAL on the same disk

Bad idea for HDDs (head contention), neutral for SSDs (no head), bad for low-end SSDs (write amplification competes). Production systems put WAL on a separate device when latency-sensitive.

When the WAL is the database

LSM-trees, Kafka, NATS JetStream, Pulsar, Apache BookKeeper — these treat the log as the primary structure and let secondary indexes / merge trees / consumers catch up. The data file in our toy example was hypothetical; LSMs make it explicit. See db-05 onward.

Encryption / compression

Compression per record: trivial, but blocks the Vec<u8> reuse pattern. Better to compress whole segments at checkpoint time.
Encryption per record: AEAD (AES-GCM or ChaCha20-Poly1305) replaces CRC32 — the auth tag is your CRC. PostgreSQL's TDE proposals use this.

Replication

Once you have a sequential log of operations, replicating it is "just" send-and-replay. This is the entire conceptual basis of Raft and ZAB — see db-17 / db-19. The framing tricks here transfer directly.

What goes wrong at scale

fsync amplification: every fsync touches the FS journal, which serializes against other fsyncs. Solution: large group commit batches.
Long fsync tails: 99th-percentile fsync on a busy NVMe can be 100ms+. Solution: pipeline; never block a hot-path thread on fsync.
Filesystems that lie: ext4 with data=writeback may complete fsync before journaling. APFS, ZFS, btrfs each have their own quirks. Empirical test with fio is the only safe answer.

Step 1 — Record framing & CRC

Goal

Define the on-disk format and a streaming CRC32 implementation that matches between Rust, Go, and C++.

Format recap

 ┌─────────┬─────────┬──────────────────────┐
 │ len(u32)│ crc(u32)│ payload (len bytes)  │
 └─────────┴─────────┴──────────────────────┘
       4         4              N

Both u32 fields are little-endian.
CRC is over the payload only.
len == 0 is the EOF sentinel (an empty payload cannot be appended).

CRC32 — table-driven, reflected

poly = 0xEDB88320  // reflected IEEE 802.3 polynomial
table[256]: built once at startup
for each input byte b:
    crc = (crc >> 8) ^ table[(crc & 0xff) ^ b]
return crc ^ 0xFFFFFFFF                  // final XOR
initial value before processing: 0xFFFFFFFF

Known-answer vectors

input	CRC32 hex
`""`	`0x00000000`
`"a"`	`0xE8B7BE43`
`"123456789"`	`0xCBF43926`

Pin these in every language's unit tests. They are the canonical crc32 IEEE vectors used by zlib, gzip, Ethernet, and the LevelDB log.

Rust outline

#![allow(unused)]
fn main() {
pub fn crc32_ieee(bytes: &[u8]) -> u32 {
    let mut c: u32 = 0xFFFF_FFFF;
    for &b in bytes {
        c = (c >> 8) ^ TABLE[((c & 0xff) ^ b as u32) as usize];
    }
    c ^ 0xFFFF_FFFF
}
}

Go outline

func Crc32IEEE(b []byte) uint32 {
    c := uint32(0xFFFFFFFF)
    for _, x := range b {
        c = (c >> 8) ^ table[byte(c)^x]
    }
    return c ^ 0xFFFFFFFF
}

C++ outline

inline std::uint32_t Crc32Ieee(std::span<const std::uint8_t> b) noexcept {
    std::uint32_t c = 0xFFFFFFFFu;
    for (auto x : b) c = (c >> 8) ^ kTable[(c & 0xff) ^ x];
    return c ^ 0xFFFFFFFFu;
}

Trap: which CRC?

There are at least eight in common use. IEEE (reflected, init 0xFFFFFFFF, final XOR 0xFFFFFFFF) is what we want. 0x04C11DB7 un-reflected is not the same value despite being the same polynomial.

If your test gives 0x4DBDF21C for "a", you're using CRC-32C (Castagnoli). Different polynomial, different table.

Step 2 — Append, sync, iterate

Goal

Implement Wal::open / append / sync / iter consistently in all three languages.

API recap

open(path)        -> Wal      // O_RDWR | O_CREAT, scan-and-truncate the tail
append(payload)  -> u64       // returns the record's start offset
sync()           -> ()        // fdatasync (or fsync on platforms without it)
len()            -> u64       // bytes in the live valid region
iter(path)       -> Iterator  // yields each payload until first short/bad record

`open` — scan & truncate

The crucial subroutine. After a crash, the file may end in a partial header or partial payload. open finds the last valid record's end and truncates the file to that length, so subsequent appends append cleanly.

pos = 0
loop:
    if file_size - pos < 8: break              // not enough for header
    read 8 bytes at pos: (len, crc)
    if len == 0: break                          // EOF sentinel / sparse hole
    if pos + 8 + len > file_size: break         // payload short
    read len bytes at pos+8
    if crc32(payload) != crc: break
    pos += 8 + len
if pos != file_size:
    ftruncate(file, pos)
return Wal { fd, write_offset = pos }

`append`

hdr[0..4] = len.to_le_bytes()
hdr[4..8] = crc32(payload).to_le_bytes()
pwrite(fd, hdr,     write_offset)
pwrite(fd, payload, write_offset + 8)
offset_returned = write_offset
write_offset += 8 + len
return offset_returned

We do not fsync inside append. Callers do that explicitly via sync() to enable group commit.

`sync`

Linux: fdatasync(fd)
macOS: fcntl(fd, F_FULLFSYNC, 0) for true device-level sync; falls back to fsync(fd) if F_FULLFSYNC fails (e.g., not on APFS).
Windows: FlushFileBuffers(handle) (out of scope here).

In this lab we use fdatasync (Linux) and fsync (macOS) for simplicity; production should consider F_FULLFSYNC on macOS because plain fsync does not guarantee device-level durability on Apple's filesystems.

`iter` — read-only replay

Mirrors open's scan loop but yields each payload instead of advancing a write cursor. Stops on the same conditions (len == 0, short header, short payload, bad CRC). Never panics on garbage.

Tests to pin behavior

#	Test	Expected
T1	Append "A", "B", reopen, iter → ["A", "B"]	Both records returned in order
T2	Append, truncate WAL by 1 byte (cut payload), reopen, iter	Last record dropped
T3	Append, flip a payload byte, iter	Reader stops at bad CRC
T4	Append, write `\0\0\0\0\0\0\0\0` past EOF, reopen	File length restored to pre-garbage size
T5	append() returned offsets are strictly increasing and equal to file size after that append	Yes

Gotchas

macOS fsync does not flush the disk write cache. Use F_FULLFSYNC for tests that must outlive a power loss.
Rust File::write_all does not call flush on the kernel level, only the userspace BufWriter. We use raw pwrite via nix / std::os::unix::fs::FileExt::write_all_at to skip the userspace buffer entirely.
Go os.File.Write is unbuffered by default, but bufio.Writer is not. Make sure your Wal does not wrap the file in a bufio.Writer — that defers writes invisibly and confuses sync.

Step 3 — Group commit benchmark

Goal

Quantify the cost of fsync and the throughput win from group commit.

Workload

bench-group PATH N BATCH:

for i in 0..N:
    append(payload)
    if (i+1) % BATCH == 0: sync()
sync()   // final

PATH is a brand-new file each run. N = 50_000 is a good starting point on a modern SSD.

Numbers to look for (M2 Pro, APFS, 64-byte payload)

Batch	Throughput	Avg latency / sync	Bytes flushed / sync
1	~1,800 rec/s	~560 µs	~72 B
8	~12,000 rec/s	~670 µs	~576 B
64	~110,000 rec/s	~580 µs	~4.6 KB
512	~260,000 rec/s	~1.0 ms	~37 KB
4096	~310,000 rec/s	~13 ms	~295 KB

Two effects worth noting:

Sync time is roughly constant up to ~4KB: the bottleneck is the per-fsync overhead (syscall + journal commit), not the byte count.
Returns diminish past batch ~256: bandwidth becomes the next limit. Past ~4096 you start hitting tail-latency cliffs.

What "broken" looks like

Per-record throughput is the same as group=64: your sync() isn't doing anything (no-op, wrong fd, or bufio.Writer swallowing the write).
Throughput keeps climbing past group=4096: you may not be calling sync() at all between batches.
macOS numbers look impossibly fast: plain fsync does not flush the device cache. Re-run with F_FULLFSYNC to compare.

Comparing to a Linux box

On NVMe + ext4:

Batch	Throughput
1	~3,000 rec/s
64	~180,000 rec/s
4096	~600,000 rec/s

The shape is identical; absolute numbers depend on the device's flush latency.

Bloom Filters and Hashing

Status: complete — runnable in Rust, Go, C++.

1. What Is It

A Bloom filter is a probabilistic set: add(x) always succeeds; contains(x) returns either definitely not present or probably present. It uses a fixed-size bit array m and k independent hash functions; add(x) sets bits at positions h_1(x) mod m, …, h_k(x) mod m; contains(x) returns true iff all those bits are set.

add("foo"):
   h1=37 h2=812 h3=4    →  bits[37]=bits[812]=bits[4]=1

contains("bar"):
   h1=99 h2=812 h3=120  →  bits[99]=0  ⇒  definitely absent
contains("foo"):
   h1=37 h2=812 h3=4    →  all 1  ⇒  probably present

False positives are inherent (any other key that hits the same k bits looks present); false negatives are impossible (a stored key set its bits, and we never unset).

2. Why It Matters

Without a bloom filter	With one
LevelDB / RocksDB `Get(k)` on a miss probes every SSTable's index → many disk reads	One in-memory bit-test per SSTable rejects 99% of misses
Distributed cache: "do I have this key?" requires a network RTT	Local bit-test on a 1 MB filter answers in nanoseconds
Spell-checker holds full dictionary	Few bits per word
Webcrawler revisits the same URL	A few bits per URL prevent recrawl

Filter sizes are tiny: at the textbook optimum (~9.6 bits/key for 1% FPR) a million keys fit in 1.2 MB. Cache-resident.

3. How It Works

For n inserts into m bits with k hashes (assuming independent uniform hashes), the probability a given bit is still zero is (1 - 1/m)^(kn) ≈ e^(-kn/m), so the false-positive rate is

$$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k$$

Differentiating with respect to k yields the optimal hash count

$$k^* = \frac{m}{n}\ln 2 \approx 0.693 \cdot \frac{m}{n}$$

and the achievable FPR at $k^*$:

$$\text{FPR}^* \approx 0.6185^{,m/n}$$

So 10 bits/key ⇒ ~1% FPR with 7 hashes; 20 bits/key ⇒ ~0.01% with 14 hashes.

Kirsch–Mitzenmacher double hashing

We do not compute k independent hashes. Per Kirsch & Mitzenmacher (2006), it is sufficient — with no measurable FPR penalty — to compute one 64-bit hash, split it into halves h1 and h2, and synthesize:

g_i(x) = h1(x) + i * h2(x)   for i = 0..k-1

This is what LevelDB, RocksDB, and most production filters do.

In this lab the underlying 64-bit hash is FNV-1a64 of the key, then mixed once through SplitMix64 to spread the bits. (FNV-1a64 alone is biased in its high bits, and the Kirsch–Mitzenmacher splitting cares about both halves being well-distributed.)

On-disk / on-wire layout

 ┌─────────┬─────────┬───────────────────────────┐
 │ k (u32) │ m (u64) │  bits  (⌈m/8⌉ bytes, LE)  │
 └─────────┴─────────┴───────────────────────────┘

Identical layout across Rust/Go/C++ so all three can read each other's filters byte-for-byte.

4. Core Terminology

Term	Definition
`m`	Bit-array size in bits.
`n`	Number of distinct keys inserted.
`k`	Number of hash functions per key.
FPR	False-positive rate at query time. False negatives are impossible.
Bits/key	`m / n`. The single knob that determines achievable FPR.
Saturation	Once a large fraction of bits are 1, FPR climbs sharply; filters should be sized for the maximum expected `n`.
Counting Bloom	Variant that supports `remove` by storing 4-bit counters per slot. Costs 4× memory.
Cuckoo filter	Modern alternative: supports delete, often lower space at FPR ≤ 1%, harder to size.
Xor filter	Static (build once, query many) — best space efficiency, but no incremental inserts.

5. Mental Models

Bloom is a hash-collision amplifier. One hash collision is rare; needing k of them simultaneously is rarer. The filter trades memory for that compounding.
A Bloom filter is a negative index. Use it to avoid work; never use it to prove presence.
Hash quality matters less than independence. Once individual bits are well-distributed, the Kirsch–Mitzenmacher trick gives you arbitrarily many "independent" hashes for free.
You can compose them. Union ⇒ bitwise OR (with same m, k). Approximate intersection ⇒ bitwise AND (overestimates).

6. Common Misconceptions

"FPR depends on the number of bits set." It depends on m, n, and k. Two filters with the same fill factor but different k have different FPR.
"Bigger k is always better." Past $k^*$, FPR climbs again because each insert sets more bits, accelerating saturation.
"I can resize a Bloom filter." No — bit positions depend on m. Resize by building a fresh filter from the underlying data (or by maintaining a scalable filter, which is a series of growing Bloom filters).
"Cryptographic hashes are required." Wasted CPU. Anything well-distributed and fast (FNV, xxhash, MurmurHash3, CityHash) works.
"remove would be cheap if I just cleared the bits." It would also clear bits set by every other key that shares positions. Counting Bloom exists for this reason.

7. Interview Talking Points

Derive $k^* = (m/n) \ln 2$ and the resulting FPR formula from first principles.
Explain Kirsch–Mitzenmacher and why it doesn't increase FPR (citation: Less Hashing, Same Performance, ESA 2006).
Walk through how RocksDB pairs a Bloom filter with each SSTable — and how the new ribbon filter improves on that.
Quantify: "for 1% FPR you need ~10 bits/key; for one-in-a-million, ~30."
Contrast Bloom vs. Cuckoo vs. xor filters and their tradeoffs.

8. Connections to Other Labs

db-06 — every SSTable carries an embedded Bloom filter.
db-07 — compaction rebuilds filters because input filters can't be merged exactly.
db-08 — filter block is cached separately from data blocks.
db-09 — LookupKey flow consults the per-table filter before reading the index block.
db-21 — prefix Bloom filters, partitioned filters, ribbon filters.

References — Bloom Filters and Hashing

Foundational papers

Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, CACM 1970. The original 2-page paper. https://dl.acm.org/doi/10.1145/362686.362692
Adam Kirsch & Michael Mitzenmacher, Less Hashing, Same Performance: Building a Better Bloom Filter, ESA 2006. https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher, Cuckoo Filter: Practically Better Than Bloom, CoNEXT 2014. https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
Thomas M. Graf & Daniel Lemire, Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters, JEA 2020. https://arxiv.org/abs/1912.08258
Peter C. Dillinger & Stefan Walzer, Ribbon Filter: Practically Smaller Than Bloom and Xor, 2021. https://arxiv.org/abs/2103.02515

Production code to read

LevelDB filter policy: https://github.com/google/leveldb/blob/main/util/bloom.cc
RocksDB filter blocks: https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter
RocksDB ribbon filter implementation: https://github.com/facebook/rocksdb/blob/main/util/ribbon_impl.h

Survey & blog posts

Daniel Lemire, "All about Bloom filters" series: https://lemire.me/blog/tag/bloom-filter/
Jeff Dean's classic numbers-every-programmer-should-know — useful when sizing filters against disk-seek and RAM costs.
Hadron, "How RocksDB sizes filters": https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

Hash functions

Fowler–Noll–Vo (FNV) reference: http://www.isthe.com/chongo/tech/comp/fnv/
SplitMix64 (Vigna & Steele's high-quality mixer): https://prng.di.unimi.it/splitmix64.c
Austin Appleby, MurmurHash3: https://github.com/aappleby/smhasher/wiki/MurmurHash3
xxHash by Yann Collet: https://github.com/Cyan4973/xxHash

db-02 — data structures for storage reused the FNV-1a64 hash now wrapped here.
db-06 — SSTable format embeds the filter generated here.

Analysis — Bloom Filters and Hashing

Problem statement

Build a fixed-memory probabilistic set that:

Never reports false negatives (lookups for present keys always return true).
Reports false positives at a tunable, well-understood rate.
Is fast enough to consult in the hot path of a key-value store lookup (≈ nanoseconds).
Has an on-disk representation identical across Rust, Go, and C++ so the same filter built in any language can be read by any other.

Constraints

Constraint	Why it matters
Compact (≤ 2 bytes/key for 5% FPR)	The filter is loaded into RAM beside the table it indexes.
Constant-time `add` and `contains`	Hot path of `Get(k)`.
Deterministic across languages	Cross-language tests must pass.
Single 64-bit hash	We synthesize `k` indices via Kirsch–Mitzenmacher — keeps CPU low.
No remove	Pure Bloom. Counting variants left to db-21.

Design decisions

Base hash = FNV-1a64 then SplitMix64 mixing. FNV-1a64 is trivial to implement identically across languages; SplitMix64 finalizing fixes its weak avalanche so the upper and lower 32 bits are both well-distributed. The two 32-bit halves become h1 and h2 for double hashing.
Kirsch–Mitzenmacher: g_i = h1 + i*h2, all u64 arithmetic, with the final mod m using a single u128-multiplication trick (Daniel Lemire, Fast Random Integer Generation in an Interval, 2018) so we don't pay for a div.
Bit array is little-endian byte-packed, bit i lives in bytes[i/8] >> (i%8) & 1. Same convention LevelDB and RocksDB use.
Header = k(u32 LE) || m(u64 LE). 12 bytes total. We deliberately put k first so a partial-read of just the header reveals the hash count without needing m.
No checksum on the filter itself. Bloom filters can tolerate a flipped bit (it adds at most a few keys' worth of false positives); pages-level checksumming belongs to db-06 (SSTable).
new_with_fpr(n, fpr) constructor. Picks m = ceil(-n * ln(fpr) / (ln 2)^2) and k = round((m/n) * ln 2). Caps k at 30 to avoid degenerate sizing for absurdly small FPRs.

Why this design over alternatives

vs MurmurHash3 / xxhash: faster and arguably higher quality, but each is hundreds of lines to re-implement identically in three languages. FNV+SplitMix is 12 lines per language and indistinguishable in our use case.
vs k independent hashes: 2× CPU for no measurable FPR change (Kirsch & Mitzenmacher 2006).
vs Cuckoo / xor filters: more space-efficient at low FPR but much more code. Worth a separate lab — db-21.
vs in-language hashers (std::hash, hash/fnv, std::hash<string>): per-language differences — Go's maphash is randomized per process; C++ std::hash<string> is implementation-defined. None of them survive cross-language interop.

Failure modes addressed

Failure	How
FPR much higher than claimed	Test V1: empirically measure FPR with 100k random queries and assert it's within 2× of the theoretical bound.
Bit packing mismatched across languages	Test V2 (cross-lang): each writer dumps its filter bytes; each reader queries it for known-present & known-absent keys.
Endian mismatch in header	All header fields encoded little-endian explicitly.
Hash function mismatch	Test V3: known FNV-1a64 vectors (`""→0xcbf29ce484222325`, `"foobar"→0x85944171f73967e8`) checked at startup.
Saturation at `n ≫ planned`	`contains` still works; FPR climbs gracefully. Filter constructors document the assumed `n`.

Failure modes NOT addressed

Concurrent inserts. Single writer model. Concurrent add corrupts overlapping byte writes. Lock externally or use atomic OR per byte — covered in db-08 / db-21.
Adversarial keys. FNV-1a64 is not cryptographic — an attacker can craft collisions to inflate FPR. Use SipHash / xxh3 (with secret seed) if filter inputs are attacker-controlled.
Deletion. Use a counting Bloom or a cuckoo filter. See db-21.

Execution — How to Build and Run

Quick start (per language)

# Rust
cd src/rust
cargo build --release
cargo test --release
./target/release/bloombench --help

# Go
cd src/go
go test ./...
go build -o bin/bloombench ./cmd/bloombench
./bin/bloombench --help

# C++
cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build
./build/bloombench --help

CLI: `bloombench`

A single binary per language. Subcommands have the same shape across all three so cross-language tests can shell out polyglot.

Subcommand	Args	Behaviour
`hash STR`	one string	Print `fnv1a64=… splitmix=… h1=… h2=…` for the given input
`build PATH N FPR`	output path, key count, target FPR	Insert keys `key0..key{N-1}` and write the filter to PATH
`query PATH KEY`	filter path, key	Print `present` or `absent` for one key
`query-file PATH KEYS_FILE`	filter path, file with one key per line	Print results for each
`fpr-test PATH N M`	filter path (built with N keys), M random absent keys	Measure FPR and print observed vs theoretical

Library API

fnv1a64(bytes) -> u64
splitmix64(u64) -> u64
mix64(bytes) -> u64                   // = splitmix64(fnv1a64(bytes))

BloomFilter::new(m_bits, k) -> Bloom
BloomFilter::with_fpr(n, fpr) -> Bloom
  - picks m, k optimally; caps k at 30
Bloom::add(bytes)
Bloom::contains(bytes) -> bool
Bloom::k(), Bloom::m_bits(), Bloom::m_bytes()
Bloom::encode() -> Vec<u8>            // header || bits
Bloom::decode(bytes) -> Bloom         // inverse, validates header

On-disk / on-wire layout

 ┌─────────┬─────────┬─────────────────────────┐
 │ k (u32) │ m (u64) │  bits  ⌈m/8⌉ bytes      │
 │         │         │  bit i = bytes[i/8] >> (i mod 8) & 1
 └─────────┴─────────┴─────────────────────────┘
       4         8                  ⌈m/8⌉

All integers little-endian. No padding. No internal checksum.

Verifying

./scripts/verify.sh        # per-language unit + property tests
./scripts/cross_test.sh    # writer/reader cross-product over {go,rust,cpp}

Observation — Looking at the Bits

1. Hexdump a freshly built filter

./build/bloombench build /tmp/bf 4 0.05
xxd /tmp/bf

00000000: 0500 0000 1f00 0000 0000 0000 1206 92    ...............

Reading the header: k=5, m=31 bits ⇒ ⌈31/8⌉ = 4 bytes of bits. The trailing 12 06 92 … is the bit vector with 4 keys mixed in. The actual high byte may differ depending on how m is rounded.

For 1000 keys at 1% FPR you should see roughly 9.6 bits/key ⇒ 1200 bytes of bits, and k ≈ 7.

2. Sanity-check the hash chain

./build/bloombench hash foobar
# fnv1a64=85944171f73967e8  splitmix=...  h1=...  h2=...

Known FNV-1a64 vectors (used in tests):

Input	`fnv1a64`
`""`	`0xcbf29ce484222325`
`"a"`	`0xaf63dc4c8601ec8c`
`"foobar"`	`0x85944171f73967e8`

All three languages must print the same fnv1a64 and same splitmix64 for any given input. If they don't, cross-language interop is dead on arrival.

3. Empirical FPR matches the formula

./build/bloombench build /tmp/bf 100000 0.01
./build/bloombench fpr-test /tmp/bf 100000 1000000
# expected: observed=0.0098  theoretical≈0.0100   (within ±20% with 1M samples)

If observed FPR is much higher than theoretical:

k is wrong (probably 0 or 1 due to integer truncation; check with_fpr math).
Hash is biased (FNV without SplitMix mixing — the high bits are clumped).
mod m step has a bias (using h % m with non-prime m is OK; using h & (m-1) only works when m is a power of two).

If observed FPR is much lower: probably double-counting or your "random absent" key generator overlaps with the present set — verify input.

4. Bit density vs FPR

for fpr in 0.10 0.05 0.01 0.001 0.0001; do
  ./build/bloombench build /tmp/bf 10000 $fpr
  ls -l /tmp/bf
done

Sample row sizes (header + body):

FPR	Bytes	Bits/key
0.10	~6 040	~4.8
0.05	~7 820	~6.2
0.01	~12 010	~9.6
0.001	~17 990	~14.4
0.0001	~23 970	~19.2

The 9.6 bits/key heuristic for 1% FPR is the one most often quoted in interviews.

5. Cross-language byte-identical filters

for lang in go rust cpp; do
  ./src/$lang/.../bloombench build /tmp/bf.$lang 1000 0.01
done
md5sum /tmp/bf.*    # all three identical

If any digest differs, suspect (in order): endian, bit ordering inside the byte, integer types of the header, or hash mismatch.

What "working" looks like

Bytes 0..3 = k, bytes 4..11 = m, bytes 12..end = bits. No padding.
Empirical FPR is within ±2× of theoretical for any (n, fpr) you try.
All three languages produce identical filters and read each other's filters.

What "broken" looks like

contains(k) returns false for a key you just inserted ⇒ false negative ⇒ critical bug. Likely indexing math: set and get disagree about bit-within-byte.
FPR is 100% ⇒ all bits are 1 ⇒ either m was rounded down to 0 or you're indexing past the bit array.
FPR is 0% with realistic load ⇒ add is a no-op or contains always returns true on the "absent" path.
Cross-language readers disagree ⇒ print the first 16 bytes of each filter and the first three h1, h2 values for a known key; one of them is wrong.

Verification — What to Test and How

Per-language property tests

#	Test	Pass if
V1	`fnv1a64_known_vectors`	`""` → `0xcbf29ce484222325`; `"a"` → `0xaf63dc4c8601ec8c`; `"foobar"` → `0x85944171f73967e8`
V2	`splitmix64_known_vectors`	`splitmix64(0)` = `0xe220a8397b1dcdaf`; `splitmix64(0xdeadbeef)` = `0x4adfb90f68c9eb9b`
V3	`no_false_negatives`	Insert N=10 000 random keys (seeded); `contains` returns true for every one
V4	`fpr_within_2x`	Build for n=10 000 at fpr=0.01; query 100 000 random absent keys; observed FPR ≤ 2× theoretical
V5	`optimal_k_formula`	`with_fpr(1000, 0.01)` returns `k=7` and `9 580 ≤ m ≤ 9 620` (allow ±0.5%)
V6	`encode_decode_roundtrip`	encode → decode → query the same keys: identical results
V7	`header_layout`	First 4 bytes = `k` LE; next 8 = `m` LE; payload length = ⌈m/8⌉
V8	`empty_filter_rejects_all`	New filter with m=64, k=3; `contains` returns false for 1000 random keys

Cross-language test

scripts/cross_test.sh performs the writer × reader matrix for {go, rust, cpp}²:

Each writer builds a filter for the same fixed-seed key set (1 000 keys).
Filters must be byte-identical (md5sum over filter file).
Each reader opens each writer's filter and runs:
- 1 000 known-present queries → must all return present
- 1 000 known-absent queries (different seed) → results must match across readers

This catches:

Endian or bit-order bugs in the header / bit array.
Hash mismatch (fnv1a64 or splitmix64 differs).
mod m reduction differs (Lemire's u128 trick vs % should yield identical indices).

What "passing" means

All 8 property tests green in all three languages.
cross_test.sh exits 0 with 9 byte-identical filter writers and 9 passing reader runs.
Manual smoke: hexdump of a 4-key filter matches the structure described in docs/observation.md.

Broader Ideas — Beyond the Minimum

Block / partitioned filters

RocksDB partitions the filter so that one filter probe touches only a single cache line. Trade: marginally higher FPR for ~3× faster contains on cache-cold filters. See Optimizing Bloom Filter: Challenges, Solutions, and Comparisons (Luo et al., IEEE 2019).

Cuckoo filters

Replace bit array with a hash table of fingerprints. Same FPR as Bloom at lower bits/key (~6 bits/key for 1% FPR), and supports remove. Slower to build, occasionally fails to insert when over-full. Excellent for membership tests with a known max size and a need for deletion.

Xor filters

Build-once, query-many. ~9% smaller than Cuckoo at the same FPR, faster lookup (always exactly 3 memory accesses). Bad fit if you insert incrementally; great for static datasets like compiled SSTable filters.

Ribbon filters (RocksDB 6.15+)

A linear-algebra reformulation of xor filters. ~30% smaller than Bloom at the same FPR, slightly slower lookup, ~10× slower to build. RocksDB now uses these as the default for SSTable filters.

Prefix Bloom filters

When most queries are by prefix (e.g. userid:12345:*), build the filter from prefixes instead of full keys. Saves space and lets prefix-range queries use the filter.

Scaling without resizing

A scalable Bloom filter (Almeida et al., 2007) chains a sequence of progressively larger filters with progressively tighter FPRs. add writes to the youngest; contains ORs across all. Memory grows logarithmically with n.

Compressed Bloom filters

If you transmit a Bloom filter over a network, sparsity makes it gzip well. Mitzenmacher (2002) showed that optimizing for compressed size leads to a different k than optimizing for in-memory FPR.

Cardinality estimation: HyperLogLog vs Bloom

Bloom tells you "in set" with FPR; HLL gives you |set| ± ~2% with constant memory. Often used in the same systems for different questions.

Filters in distributed systems

Bigtable / HBase: block-level filters per SSTable.
Cassandra: row-level filter per SSTable, plus a key cache.
Akamai / CDNs: "did this URL get cached?" Bloom-based pre-checks.
Gmail: per-user spam fingerprint filters.
Bitcoin SPV clients (BIP 37): filters published to full nodes to indicate which addresses the SPV wallet cares about. Famously broken from a privacy standpoint — the filter leaks the address set.

Adversarial considerations

Bloom-filter parameters and hashes are usually public. If users can choose keys, they can craft collisions that fill the filter and push FPR to 100%. Defenses:

Use a keyed hash (SipHash) seeded at filter creation.
Cap inserts or fall back to an exact structure beyond a threshold.
Periodically rebuild.

Information-theoretic lower bound

Carter et al. (1978) prove that any approximate set with FPR ε requires at least n * log2(1/ε) bits — that's ~6.64 bits/key for 1% FPR. Bloom uses ~9.6 (44% overhead). Xor filters approach the bound at ~9% overhead. Ribbon filters get within ~3%.

Step 01 — Hash chain (FNV-1a64 → SplitMix64 → double hashing)

Before any filter logic, get the hash chain identical across all three languages. If fnv1a64("foobar") doesn't return 0x85944171f73967e8 everywhere, nothing else will work.

1. FNV-1a64

Algorithm:

hash = 0xcbf29ce484222325
for byte in input:
    hash ^= byte
    hash = hash * 0x100000001b3        // wrapping 64-bit multiplication
return hash

Known test vectors:

Input	Result
`""` (empty)	`0xcbf29ce484222325` (the initial value)
`"a"`	`0xaf63dc4c8601ec8c`
`"foobar"`	`0x85944171f73967e8`

Side-by-side:

#![allow(unused)]
fn main() {
pub fn fnv1a64(bytes: &[u8]) -> u64 {
    let mut h: u64 = 0xcbf29ce484222325;
    for &b in bytes {
        h ^= b as u64;
        h = h.wrapping_mul(0x100000001b3);
    }
    h
}
}

func FNV1a64(b []byte) uint64 {
    var h uint64 = 0xcbf29ce484222325
    for _, x := range b {
        h ^= uint64(x)
        h *= 0x100000001b3
    }
    return h
}

std::uint64_t Fnv1a64(const std::uint8_t* p, std::size_t n) {
    std::uint64_t h = 0xcbf29ce484222325ULL;
    for (std::size_t i = 0; i < n; ++i) {
        h ^= p[i];
        h *= 0x100000001b3ULL;
    }
    return h;
}

⚠️ Two-byte traps: don't use FNV-1 (not 1a — different order of XOR vs multiply); don't use the 32-bit prime or basis (different constants).

2. SplitMix64 finalizer

FNV-1a64 has decent low bits but biased high bits. SplitMix64 (Vigna & Steele) is a single-step bit mixer that produces near-perfect avalanche on a 64-bit input. We apply it to FNV's output so that both the upper and lower 32-bit halves are usable as independent hashes.

splitmix64(x):
    x = x + 0x9e3779b97f4a7c15
    x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9
    x = (x ^ (x >> 27)) * 0x94d049bb133111eb
    return  x ^ (x >> 31)

Known vectors:

Input	Output
`0`	`0xe220a8397b1dcdaf`
`0xdeadbeef`	`0x4adfb90f68c9eb9b`
`splitmix64(fnv1a64("foobar"))`	use the test to lock this in

The combined hash we actually use:

mix64(bytes) := splitmix64(fnv1a64(bytes))
h1 = mix64(bytes) & 0xffffffff               // low 32 bits
h2 = mix64(bytes) >> 32                       // high 32 bits

3. Synthesizing `k` indices (Kirsch–Mitzenmacher)

For a bit array of size m and k hashes:

for i in 0..k:
    g = h1 + i * h2          // wrapping 64-bit add
    idx = g mod m            // reduce to [0, m)
    use bits[idx]

The reduction g mod m is hot. Naive % works but a modulo on a 64-bit integer is ~20 cycles. RocksDB and others use Lemire's fast reduction:

fastmod(g, m) = ((g as u128) * (m as u128)) >> 64

(Equivalent to floor(g * m / 2^64), a near-uniform map of [0, 2^64) to [0, m).) Either approach is fine for correctness — but pick one and use it in all three languages, otherwise the bit positions diverge and cross-language tests break.

This lab uses plain % because it's identical across languages with no language-specific u128 syntax to worry about. Performance difference is irrelevant at filter-construction scale.

Test gate

Before moving on, all three bloombench hash foobar invocations must print:

fnv1a64=85944171f73967e8
splitmix=...        (same value across languages)
h1=...  h2=...      (same values across languages)

If those don't match, the rest of the lab cannot succeed.

Step 02 — Bit array, add, contains

The bit array

Backed by a Vec<u8> / []byte / std::vector<uint8_t> of length ⌈m / 8⌉. Indexing is little-endian within each byte:

bit i  →  byte index  = i / 8
         bit within  = i % 8         // bit 0 is the LSB
         test:  (bytes[i/8] >> (i%8)) & 1
         set:    bytes[i/8] |= 1 << (i%8)

Side-by-side:

#![allow(unused)]
fn main() {
fn set_bit(bits: &mut [u8], i: u64) {
    let idx = (i / 8) as usize;
    let off = (i % 8) as u8;
    bits[idx] |= 1u8 << off;
}
fn get_bit(bits: &[u8], i: u64) -> bool {
    let idx = (i / 8) as usize;
    let off = (i % 8) as u8;
    (bits[idx] >> off) & 1 == 1
}
}

func setBit(bits []byte, i uint64) {
    bits[i/8] |= 1 << (i % 8)
}
func getBit(bits []byte, i uint64) bool {
    return bits[i/8]&(1<<(i%8)) != 0
}

inline void SetBit(std::uint8_t* b, std::uint64_t i) {
    b[i / 8] |= std::uint8_t{1} << (i % 8);
}
inline bool GetBit(const std::uint8_t* b, std::uint64_t i) {
    return (b[i / 8] >> (i % 8)) & 1u;
}

⚠️ Pick one bit-order now. LSB-first (above) matches LevelDB and is the natural choice in C. MSB-first matches some networking specs (TCP option encoding). Whichever you pick, all three implementations must agree.

`add(key)`

add(key):
    h = mix64(key)
    h1 = h & 0xffffffff
    h2 = h >> 32
    for i in 0..k:
        idx = (h1 + i * h2) mod m
        set_bit(idx)

Notes:

All arithmetic is wrapping 64-bit (u64/uint64/std::uint64_t).
i * h2 overflows for large k. That's fine — mod m will still produce a valid index. Languages with overflow checks (Rust debug mode) need wrapping_mul/wrapping_add.
We compute h once per key, then derive k indices. That's the entire Kirsch–Mitzenmacher win.

`contains(key)`

contains(key):
    h = mix64(key)
    h1 = h & 0xffffffff
    h2 = h >> 32
    for i in 0..k:
        idx = (h1 + i * h2) mod m
        if not get_bit(idx):
            return false
    return true

Early-exit on the first zero bit. With FPR 1% and a random absent key, you typically exit after 1 or 2 probes.

Three-language `add` skeleton

#![allow(unused)]
fn main() {
pub fn add(&mut self, key: &[u8]) {
    let h = mix64(key);
    let h1 = h as u32 as u64;
    let h2 = h >> 32;
    for i in 0..self.k as u64 {
        let idx = h1.wrapping_add(i.wrapping_mul(h2)) % self.m;
        set_bit(&mut self.bits, idx);
    }
}
}

func (b *Bloom) Add(key []byte) {
    h := Mix64(key)
    h1 := h & 0xffffffff
    h2 := h >> 32
    for i := uint64(0); i < uint64(b.k); i++ {
        idx := (h1 + i*h2) % b.m
        setBit(b.bits, idx)
    }
}

void Bloom::Add(const std::uint8_t* k, std::size_t n) {
    std::uint64_t h = Mix64(k, n);
    std::uint64_t h1 = h & 0xffffffffULL;
    std::uint64_t h2 = h >> 32;
    for (std::uint64_t i = 0; i < k_; ++i) {
        std::uint64_t idx = (h1 + i * h2) % m_;
        SetBit(bits_.data(), idx);
    }
}

Test gate

Insert keys "k0".."k999" (UTF-8). contains("kN") must be true for every N. Any false negative is a critical bug.
Filter must be byte-identical across the three languages (md5sum).

What broken looks like

contains is sometimes false for present keys → set and get disagree on bit-within-byte. Suspect LSB vs MSB.
Cross-language filters differ → mod m reduction differs (one uses & instead of % and m isn't a power of two), or h1/h2 halves are swapped.
contains is always true → m was constructed as 0; bit array is empty so every (h % 0) panics or all indices land in a never-cleared byte.

Step 03 — Sizing, encode/decode, FPR measurement

Picking `m` and `k` from `(n, fpr)`

Given target false-positive rate p and expected key count n:

$$m = \left\lceil \frac{-n \cdot \ln p}{(\ln 2)^2} \right\rceil$$

$$k = \max\left(1,; \text{round}\left(\frac{m}{n} \cdot \ln 2\right)\right)$$

Reference numbers (compute once, hard-code in tests):

n	p	m (bits)	bits/key	k
1 000	0.10	~4 793	4.79	3
1 000	0.01	~9 586	9.59	7
1 000	0.001	~14 378	14.38	10
10 000	0.01	~95 851	9.59	7

Implementation:

with_fpr(n, p):
    ln2     = ln(2)
    m_real  = -(n as f64) * ln(p) / (ln2 * ln2)
    m       = ceil(m_real)
    k_real  = (m as f64 / n as f64) * ln2
    k       = round(k_real) clamped to [1, 30]
    return BloomFilter::new(m, k)

The clamp on k prevents pathological cases. with_fpr(1, 1e-100) would request k ≈ 332 and almost certainly saturate the filter.

Encode

[ k: u32 LE ][ m: u64 LE ][ bits: ⌈m/8⌉ bytes ]

#![allow(unused)]
fn main() {
pub fn encode(&self) -> Vec<u8> {
    let mut out = Vec::with_capacity(12 + self.bits.len());
    out.extend_from_slice(&self.k.to_le_bytes());
    out.extend_from_slice(&self.m.to_le_bytes());
    out.extend_from_slice(&self.bits);
    out
}
}

func (b *Bloom) Encode() []byte {
    out := make([]byte, 12+len(b.bits))
    binary.LittleEndian.PutUint32(out[0:4], b.k)
    binary.LittleEndian.PutUint64(out[4:12], b.m)
    copy(out[12:], b.bits)
    return out
}

std::vector<std::uint8_t> Bloom::Encode() const {
    std::vector<std::uint8_t> out(12 + bits_.size());
    EncodeU32LE(out.data() + 0, k_);
    EncodeU64LE(out.data() + 4, m_);
    std::memcpy(out.data() + 12, bits_.data(), bits_.size());
    return out;
}

Decode

decode(bytes):
    if len(bytes) < 12: error
    k = read u32 LE @ 0
    m = read u64 LE @ 4
    body = bytes[12:]
    if len(body) != ⌈m/8⌉: error
    return BloomFilter { k, m, bits: body }

Validate sizes. If k == 0 or m == 0, reject — those are nonsense.

Measuring FPR

fpr-test(filter, n, m_queries):
    seed reader rng to disjoint stream
    hits = 0
    for q in 0..m_queries:
        key = generate distinctly-not-inserted key
        if filter.contains(key):
            hits += 1
    observed = hits / m_queries
    theoretical = (1 - exp(-k * n / m))^k
    print observed, theoretical

Generating known-absent keys: use the same family as the inserted ones, but with indices ≥ n. If insert used "key0", "key1", ..., "key{n-1}", query with "q0", "q1", ... — different prefix guarantees no accidental overlap.

A million absent queries gives ±10% noise on a 1% FPR estimate; that's the sample size used in the test fpr_within_2x.

Test gate

with_fpr(1000, 0.01) returns k=7 and m ∈ [9 581, 9 591].
encode then decode gives an identical filter.
fpr-test with n=10 000, m_queries=100 000 reports observed FPR within 2× of theoretical (well within ±50%).
The encoded filter is byte-identical across Rust / Go / C++.

What broken looks like

k=0 from with_fpr → integer truncation; you used int(k_real) instead of round.
Decode fails on a perfectly valid file → endian mismatch or header offset wrong.
Observed FPR is exactly 1.0 → bit array got written but indices land outside its range (modulo bug).
Observed FPR is exactly 0.0 → contains always returns false; bit array isn't being touched on add (you forgot to mutate self).

LSM MemTable

Lab: db-05 — the in-memory write buffer of an LSM-tree.

1. What Is It

A MemTable is the in-memory, sorted write buffer at the top of every Log-Structured Merge tree (LSM). All writes — put, delete, range updates — land in the MemTable first, indexed by key, and only later get flushed to immutable on-disk SSTables (see db-06). It is paired with a Write-Ahead Log (db-03) for durability: WAL gives crash recovery; the MemTable gives fast point and range lookups.

This lab implements a deterministic, byte-identical MemTable across Rust, Go, and C++ that can be serialized to disk and read back in any of the three languages.

2. Why It Matters

Write throughput. Writes touch only RAM (plus a single sequential WAL append). Random puts become sequential disk traffic.
Read recency. The MemTable is the freshest copy of any key; a get must consult it first before falling through to L0..Ln SSTables.
Flush boundary. Once the MemTable hits its size cap (write_buffer_size in LevelDB/RocksDB), it freezes, a new MemTable rotates in, and the frozen one is written sequentially to an SSTable on background threads.
Tombstones. Deletes are inserts of tombstone records; the MemTable must preserve them through flush so older SSTables can be shadowed.

3. How It Works

                writes                      reads
                  │                           │
                  ▼                           ▼
   ┌──────── MemTable (active) ─────────┐  point/range
   │   sorted map: key → (type, value)  │◄──────────┐
   │   tombstones live alongside values │           │
   └──────────────────┬─────────────────┘           │
        size > cap?   │                              │
                      ▼                              │
       ┌── Immutable MemTable (frozen) ─┐            │
       │   flushing in the background    │◄───────────┤
       └──────────────────┬──────────────┘           │
                          ▼                          │
                   SSTable on disk ─────────────────►┘
                   (db-06 format)

Internally the MemTable is a sorted associative container with byte-lexicographic key ordering:

Rust: BTreeMap<Vec<u8>, Entry> (Vec<u8>'s Ord is lex over bytes).
Go: map[string]Entry + key slice sorted on dump/iteration.
C++: std::map<std::vector<uint8_t>, Entry> (operator< on vectors is lex).

Production LSMs (LevelDB, RocksDB) use a skip list because it offers concurrent lock-free reads and allocations from an arena. For this lab the simpler tree is fine — what matters is the order-determinism and the on-disk byte layout.

4. Core Terminology

Term	Definition
MemTable	Sorted in-memory map of keys to values/tombstones; the LSM write buffer.
Immutable MemTable	A frozen MemTable, no longer accepting writes, awaiting flush.
Tombstone	A delete marker stored as an entry of type `Delete`; needed because older SSTables may still hold the key.
Skip list	Randomized layered linked-list giving expected `O(log n)` insert/lookup; LevelDB/RocksDB's choice.
Flush	Writing a frozen MemTable out as an SSTable.
Sequence number	Monotonically increasing version tag attached to each write so readers can pick the right snapshot.
Arena	Bump allocator that backs MemTable nodes; freed in one go when the table is dropped.

5. Mental Models

Three-layer journal. WAL = durability log. MemTable = sorted index over the WAL's recent tail. SSTable = compacted, immutable snapshot. The MemTable is the short-term, queryable face of the WAL.
Latest write wins. For a single point lookup the MemTable always shadows any on-disk data; a tombstone in the MemTable hides every prior value of that key.
Flush is amplification's first knob. Larger MemTables → fewer, bigger L0 SSTables → less compaction work but more recovery time and RAM. Production tunes this between 16 MiB and 256 MiB.
Why sorted? Because the flush writes the SSTable in a single sequential pass — no on-disk sort needed if the MemTable is already ordered.

6. Common Misconceptions

"The MemTable is the WAL." No. The WAL is unsorted, append-only, and may contain redundant updates for the same key. The MemTable is sorted and deduplicated.
"Tombstones can be GC'd in the MemTable." No — they must be flushed; only after compaction confirms no older SSTable holds the key can a tombstone be dropped.
"You can skip the WAL if writes are batched." The MemTable lives in RAM. Without the WAL a crash loses every unflushed write.
"Skip list is the only valid structure." A B-tree, ART, or sorted vector with occasional rebuild are all viable; skip list wins for the specific concurrency pattern of one writer + many readers.

7. Interview Talking Points

Explain why an LSM uses a MemTable + WAL instead of writing directly to a sorted on-disk file (random I/O kills throughput).
Walk through the lifecycle: put → WAL append → MemTable insert → eventually frozen → flushed → compacted.
Describe how a get traverses MemTable → immutable MemTable → L0 SSTables → Ln, stopping at the first match (value or tombstone).
Cost of tombstones: read amplification grows because we cannot skip a level just because we found nothing; we might still find a tombstone later.
Why a MemTable's flush is a sorted sequential write — and why this is the primary trick that makes LSMs faster than B-trees for write-heavy workloads.

8. Connections to Other Labs

db-03 (WAL): every MemTable write is preceded by a WAL append; the WAL is replayed into a fresh MemTable on startup.
db-04 (Bloom filters): SSTables produced by MemTable flush carry Bloom filters for negative lookups.
db-06 (SSTable format): the flush target; this lab's flush_to is the producer side of db-06's open.
db-07 (compaction): consumes SSTables that came from MemTable flushes.
db-09 (LevelDB complete): stitches all of the above into a working KV store.

References — db-05 LSM MemTable

Primary sources

O'Neil, Cheng, Gawlick, O'Neil (1996). The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica 33(4). The original paper. https://www.cs.umb.edu/~poneil/lsmtree.pdf
Sanjay Ghemawat & Jeff Dean. LevelDB. The reference open-source LSM. See db/memtable.{h,cc} and db/skiplist.h. https://github.com/google/leveldb
Facebook RocksDB Wiki — MemTable. https://github.com/facebook/rocksdb/wiki/MemTable Covers SkipList, HashSkipList, HashLinkList, and Vector MemTable factories.

Skip lists

William Pugh (1990). Skip Lists: A Probabilistic Alternative to Balanced Trees. CACM 33(6): 668–676. https://homepage.cs.uiowa.edu/~ghosh/skip.pdf
LevelDB's skiplist implementation — concurrent single-writer/many-reader, lock-free reads via memory-order acquire/release. https://github.com/google/leveldb/blob/main/db/skiplist.h

Alternative data structures

Bw-tree (Microsoft, 2013): lock-free B+ tree variant used in Hekaton.
Adaptive Radix Tree (ART, 2013): compact, cache-friendly trie used by HyPer and DuckDB. https://db.in.tum.de/~leis/papers/ART.pdf
Masstree (Mao, Kohler, Morris, 2012): trie-of-B+trees, very fast for variable length keys.

Tombstones and snapshot reads

RocksDB DeleteRange. Tombstones over key ranges, important for prefix deletes. https://github.com/facebook/rocksdb/wiki/DeleteRange
LevelDB sequence numbers. Each MemTable entry is internally tagged with a 56-bit sequence and 8-bit type byte; this lab simplifies to just the type byte. See db/dbformat.h kValueTypeForSeek.

Real-world tunings

Cassandra: uses Memtable with off-heap allocators; flushed to SSTables on size, time, or commit-log pressure.
HBase: MemStore per column family; configurable via hbase.hregion.memstore.flush.size.
InfluxDB IOx & TimescaleDB: apply LSM ideas to time-series, with time-bucketed MemTables.

Analysis — db-05 LSM MemTable

Problem

Implement the in-memory write buffer of an LSM-tree such that

it supports put, delete (tombstone insertion), get, and ordered iteration;
it can be serialized to disk in a deterministic byte layout shared by Rust, Go, and C++;
the same dump can be reloaded in any of the three languages;
iteration order is byte-lexicographic on keys.

Constraints

Keys are arbitrary byte sequences up to 2^32 − 1 bytes (u32 length prefix).
Values are arbitrary byte sequences up to 2^32 − 1 bytes; for tombstones the value length is 0.
Cross-language interop: the dump format must be identical byte-for-byte and the cross-test script asserts SHA-256 equality of the three dumps.
No allocator tricks: simplicity over LevelDB-style arena/skiplist; we use the standard sorted map in each language.
No concurrency in this lab: single-threaded API. Concurrency arrives in db-09.

Design decisions

Why a sorted associative container, not a skip list?

Production LSMs (LevelDB, RocksDB) use skip lists because they support concurrent lock-free reads and arena allocation. For this teaching lab those benefits are irrelevant — what matters is determinism, byte-identical serialization, and the fact that iteration is in key order so the flush is a sequential write. Any sorted container satisfies that:

Language	Container	Why
Rust	`BTreeMap<Vec<u8>, Entry>`	`Vec<u8>: Ord` is byte-lex; balanced.
Go	`map[string]Entry` + `sort.Strings(keys)`	Avoid third-party sorted maps.
C++	`std::map<std::vector<uint8_t>, Entry>`	RB-tree; `vector<uint8_t>::operator<` is lex.

All three give the same iteration order for identical input, which is what cross-test checks.

Tombstones as entries

A delete(k) replaces whatever entry k had with Entry::Tombstone. Crucially the key is not erased — the tombstone must propagate to the SSTable to shadow older on-disk versions of the key.

On-disk dump layout

   offset  size  field
   ------  ----  --------------------------------
        0     4  magic ASCII "MMT1"
        4     4  count: u32 LE (entry count)
        8     ?  entries, sorted by key ascending:
                   [ klen: u32 LE ]
                   [ vlen: u32 LE ]   (0 if tombstone)
                   [ type: u8     ]   (0 = Value, 1 = Tombstone)
                   [ key bytes    ]
                   [ value bytes  ]

All multi-byte integers little-endian; the file is self-delimiting via count and each entry's two length prefixes. No checksum at this layer — the WAL (db-03) and the SSTable (db-06) carry their own.

Size accounting

size_bytes() returns the on-disk dump size assuming the current contents flush immediately: 8 bytes header + per entry (9 + klen + vlen). This is what an LSM would compare against write_buffer_size.

Error model

The decoder validates:

magic == MMT1,
enough bytes remain for each header field and the declared key/value spans,
type is 0 or 1,
tombstones have vlen == 0,
no trailing garbage,
keys appear in strictly ascending order.

A failure returns an explicit error (Error::* in Rust, error in Go, std::invalid_argument / std::runtime_error in C++) rather than panicking. The encoder cannot fail (no I/O at that layer).

Trade-offs

No sequence numbers. Real LSMs prepend a 64-bit (seqno << 8) | type to every internal key so MVCC snapshots can pick the right version. We collapse to "latest write wins" because db-13 reintroduces MVCC.
No range tombstones. Each delete shadows exactly one key. RocksDB-style range deletes are db-09 work.
No prefix bloom or compressed entries. The MemTable is in RAM; flushing to a proper SSTable (db-06) is where compression and block boundaries appear.
Allocation policy: Vec/vector/[]byte-per-entry, not an arena. Allocator pressure becomes interesting only at multi-million-key scales, which we exercise in db-22 benchmarking.

Alternatives considered

Skip list with arena (LevelDB style). Better concurrency, cache locality, and drop-the-whole-arena freeing — but the data structure complexity (random levels, acquire/release pointer ops) would dwarf the lab's pedagogical point.
Sorted vector with binary search. Lowest memory overhead, but every put is O(n) due to mid-vector insertion. Fine for tiny tables (<1 K entries), terrible beyond that.
HashMap with periodic sort. Fast inserts, but iteration is no longer cheap; every flush triggers a sort. Acceptable if flush is rare, painful otherwise.
B-epsilon tree. Batches writes inside internal nodes, blurring the line with LSM. Out of scope.

Execution — db-05 LSM MemTable

Library API (Rust shape; mirrored in Go and C++)

#![allow(unused)]
fn main() {
pub enum Entry { Value(Vec<u8>), Tombstone }

pub struct MemTable { /* sorted map */ }

impl MemTable {
    pub fn new() -> Self;
    pub fn len(&self) -> usize;
    pub fn size_bytes(&self) -> usize;
    pub fn put(&mut self, key: &[u8], value: &[u8]);
    pub fn delete(&mut self, key: &[u8]);
    pub fn get(&self, key: &[u8]) -> Option<&Entry>;
    pub fn iter(&self) -> impl Iterator<Item = (&[u8], &Entry)>;
    pub fn encode(&self) -> Vec<u8>;
    pub fn decode(bytes: &[u8]) -> Result<Self, Error>;
}
}

Go: func New() *MemTable, func (*MemTable) Put / Delete / Get / Iter / Encode, func Decode([]byte) (*MemTable, error). Iter yields a slice of (key, entry) pairs in sorted order.

C++: class MemTable with the same names; Iter() returns a const reference to the underlying std::map.

CLI: `memtable`

memtable <subcommand> [args]

Subcommands:
  new PATH                       create an empty MemTable at PATH
  put PATH KEY VALUE             open PATH, put, save
  del PATH KEY                   open PATH, delete (writes tombstone), save
  get PATH KEY                   print 'value: <hex>' | 'tombstone' | 'absent'
  iter PATH                      print one line per entry: TYPE KEY VALUE  (hex)
  bulk PATH N                    open or create PATH, insert key0..key{N-1}
                                 with values val0..val{N-1}, save
  size PATH                      print 'entries=N size_bytes=B'

Iter output format (deterministic, used by cross-test):

V <hex-key> <hex-value>
T <hex-key>

Hex is lowercase, no 0x prefix, no separators.

Build & test

Per language:

# Rust
cd src/rust && cargo test --release && cargo build --release

# Go
cd src/go && go test ./... && go build -o bin/memtable ./cmd/memtable

# C++
cd src/cpp && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
            && cmake --build build && ( cd build && ctest --output-on-failure )

Or run all at once:

bash scripts/verify.sh

Cross-language interop test

scripts/cross_test.sh:

Build all three binaries.
Drive each one through the same sequence of bulk 100, a handful of puts with overwrites, and a handful of dels.
SHA-256 each dump; assert all three match.
For each writer/reader pair, run iter and check the output is byte-identical.

If any pair differs, the script prints the failing combination and exits non-zero.

Manual playground

$ memtable new /tmp/m.bin
$ memtable put /tmp/m.bin alpha "first"
$ memtable put /tmp/m.bin beta  "second"
$ memtable put /tmp/m.bin alpha "first-updated"   # overwrite
$ memtable del /tmp/m.bin beta                    # tombstone
$ memtable iter /tmp/m.bin
V 616c706861 66697273742d75706461746564
T 62657461
$ memtable get /tmp/m.bin alpha
value: 66697273742d75706461746564
$ memtable get /tmp/m.bin beta
tombstone
$ memtable get /tmp/m.bin gamma
absent
$ memtable size /tmp/m.bin
entries=2 size_bytes=37

37 = 8 (header) + (9+5+13) + (9+4+0) = 8 + 27 + 13 — two entries with key "alpha"/value "first-updated" and tombstone for key "beta".

What broken looks like

Symptom	Likely cause
Cross-test SHA mismatch on first byte set	Magic disagreement (must be ASCII `MMT1`).
Cross-test SHA mismatch mid-file	Endianness or `type` byte placement differs.
`iter` order differs across langs	Go's map iteration order; missed the `sort.Strings`.
`get` returns `absent` after `del`	Tombstone not stored, only erased.
Decoder accepts trailing garbage	Forgot the "consumed all bytes" check.

Observation — db-05 LSM MemTable

Hex layout of a tiny dump

Three operations: put alpha first, put beta second, del beta.

hexdump -C m.bin

00000000  4d 4d 54 31 02 00 00 00  05 00 00 00 05 00 00 00  |MMT1............|
00000010  00 61 6c 70 68 61 66 69  72 73 74 04 00 00 00 00  |.alphafirst.....|
00000020  00 00 00 01 62 65 74 61                          |....beta|

Annotated:

Offset	Bytes	Field
0	`4d 4d 54 31`	magic ASCII `MMT1`
4	`02 00 00 00`	count = 2
8	`05 00 00 00`	klen = 5 (`alpha`)
12	`05 00 00 00`	vlen = 5 (`first`)
16	`00`	type = Value
17	`61 6c 70 68 61`	key bytes `alpha`
22	`66 69 72 73 74`	value bytes `first`
27	`04 00 00 00`	klen = 4 (`beta`)
31	`00 00 00 00`	vlen = 0 (tombstone)
35	`01`	type = Tombstone
36	`62 65 74 61`	key bytes `beta`

Total: 40 bytes; matches size_bytes() = 8 + (9+5+5) + (9+4+0) = 40.

Cross-language byte equality

scripts/cross_test.sh produces three files rust.bin, go.bin, cpp.bin. With the verify script in this lab:

$ shasum -a 256 *.bin
b67…  rust.bin
b67…  go.bin
b67…  cpp.bin

If any one of the three differs we either have endianness disagreement, key ordering disagreement, or someone wrote the type byte in a different position.

Memory layout intuition

key                          entry
"abc"  ──►  Entry::Value("..."  10 bytes)
"abd"  ──►  Entry::Tombstone
"abz"  ──►  Entry::Value(""      0 bytes)   # empty value is legal and ≠ tombstone
"zz"   ──►  Entry::Value("..."  4096 bytes)

Notes:

The key length is not stored alongside the in-memory entry — only at encode time.
An empty value ("") is a valid value, distinct from Tombstone. The type byte is what discriminates them.

`size_bytes()` table

For a MemTable with n entries of average key length k̄ and average value length v̄, with a fraction f being tombstones:

$$ \text{size_bytes}(n, k̄, v̄, f) = 8 + n \cdot (9 + k̄) + n(1-f) \cdot v̄ $$

For default LSM tunings:

n	k̄	v̄	f	size_bytes
10 000	16	100	0	1,250,008
100 000	32	256	0.01	28,634,008
1 000 000	64	1024	0	1,097,000,008

(Compare to a real LevelDB write_buffer_size of 4 MiB or RocksDB's 64 MiB default — the table above shows you'd flush a 10K-entry buffer at about a megabyte.)

What an empty MemTable looks like

hexdump -C empty.bin
00000000  4d 4d 54 31 00 00 00 00                           |MMT1....|

8 bytes. size_bytes() returns 8. len() returns 0.

Iteration order corner cases

keys = ["", "\x00", "\x00\x00", "a", "ab", "b"]

Sorted byte-lex order:

""         (empty key — sorts first)
"\x00"
"\x00\x00"
"a"
"ab"
"b"

Empty keys are legal in this design (klen = 0). They are useful when the key is something like a single byte tag followed by an optional suffix.

Verification — db-05 LSM MemTable

Unit tests (per language)

ID	Test name	What it asserts
V1	`empty_encode_decode`	`MemTable::new().encode()` → 8 bytes `MMT1\x00\x00\x00\x00`; decode round-trips to an empty table.
V2	`put_then_get`	After `put("k","v")`, `get("k")` returns `Value("v")`.
V3	`overwrite_replaces`	Two puts on the same key keep only the latest value; `len()` stays at 1.
V4	`delete_writes_tombstone`	After `put("k","v")` then `del("k")`, `get("k")` returns `Tombstone` (not `None`).
V5	`iter_byte_lex_order`	Insert keys in random order; iteration yields them sorted byte-lex (`""` first, `\x00` next, etc.).
V6	`encode_decode_round_trip`	Build a 50-entry table with a mix of values and tombstones; encode → decode → every entry matches and `len()` is preserved.
V7	`size_bytes_matches_encode`	For any table, `size_bytes()` == `encode().len()`.
V8	`decoder_rejects_bad_magic`	`decode(b"XXX1...")` returns `Err`.
V9	`decoder_rejects_truncation`	Truncate a valid dump at every byte boundary; decode must fail cleanly (no panic).
V10	`decoder_rejects_unsorted_keys`	Hand-craft a dump where keys go `["b","a"]`; decoder rejects.

Cross-language interop (`scripts/cross_test.sh`)

The same scripted scenario runs in each language:

new   → bulk 100 → put "key50" "REPLACED"
                → del "key10"
                → put "" "empty-key-value"
                → del "key99"
                → save

This produces dumps rust.bin, go.bin, cpp.bin. The script then:

SHA-256s all three dumps. All must match — this is the byte-identical gate.
3×3 reader matrix. Every reader (rust/go/cpp) runs iter on every writer's dump. The lines must be identical across all 9 combinations.
get spot-check. Each reader queries key50, key10, key99, "", and an absent key nonexistent; results must be value: 5245504c41434544 (REPLACED), tombstone, tombstone, value: 656d7074792d6b65792d76616c7565, absent respectively across all readers.

End-to-end verification (`scripts/verify.sh`)

bash scripts/verify.sh

Builds and tests all three languages, then runs the cross-test. Final line must be ALL GREEN.

Manual sanity checks

memtable new /tmp/m && wc -c /tmp/m → exactly 8 bytes.
memtable bulk /tmp/m 1000 && memtable size /tmp/m → matches the formula 8 + 1000 * (9 + len("keyN") + len("valN")) summed over N=0..999.
Hexdump the first 16 bytes of any dump and confirm magic + count.

What broken looks like

Symptom	Diagnostic
`decode` accepts `b"\x00\x00\x00\x00"` (no magic check)	Add magic test V8.
Two readers print different `iter` output for the same dump	Either type-byte misplaced, or one language is comparing by string instead of bytes (UTF-8 vs raw).
`len()` differs across langs after the same script	Go's map+sort path lost a duplicate; check overwrite path.
Dump grows monotonically after `del`	Tombstone path is creating a new entry under a different key; check key equality.
Random crash in C++ on `decode` of truncated input	Missing length check before `memcpy`; bounds-check every read.

Broader Ideas — db-05 LSM MemTable

The MemTable in this lab is intentionally minimal. Real systems extend it in many directions; this doc maps the design space.

Concurrency-friendly structures

Skip list (LevelDB, RocksDB). Single writer + many readers, lock-free reads via memory-order acquire/release. The dominant choice for LSM MemTables.
Bw-tree (Hekaton). Lock-free B+ tree using delta records and a mapping table; shines on multi-writer workloads.
ART (Adaptive Radix Tree). Cache-friendly trie; very fast point lookups, used by HyPer, DuckDB, and recent CockroachDB internals for some indexes.
Masstree. Trie-of-B+trees; outperforms skip list on long variable keys.

Arena allocation

LevelDB's MemTable allocates all skip-list nodes from a bump-arena. Freeing is O(1) (drop the arena). RocksDB has a configurable Arena and a ConcurrentArena for parallel writes. Real benefit: less fragmentation and one cache-line probe per allocation. Our lab uses standard allocators because the lesson is the data layout, not the allocator.

Sequence numbers & MVCC

Production LSMs prepend a 64-bit sequence number (and an 8-bit type byte) to every internal key. Snapshot reads pick the latest sequence ≤ the snapshot's tag. db-13 revisits this when we add MVCC; here we collapse to last-write-wins.

Range tombstones

A single tombstone shadows one key. RocksDB's DeleteRange tombstones cover a key range [start, end) and live in a separate auxiliary structure inside the MemTable (RangeDelAggregator). This avoids exploding the MemTable size when bulk-deleting. Adding it would require:

A RangeTombstone struct: (start, end, seqno).
A second sorted container inside MemTable.
get consults both: a key shadowed by an overlapping range tombstone returns Tombstone even if it has a Value entry.

Multiple MemTables (active + immutable list)

Production engines keep one active MemTable plus a list of immutable MemTables awaiting flush. Reads consult [active, ...immutables, L0, L1, ...] in order. Writers swap atomically (active → immutable + new empty active) when the size cap is hit. This decouples flush latency from write latency.

Write amplification interplay

The MemTable size cap (write_buffer_size) is the first knob in the LSM write amplification dial:

Larger MemTable → fewer, bigger L0 SSTables → less L0 compaction → lower write amp but slower recovery and more RAM.
Smaller MemTable → more L0 SSTables → more compaction work → higher write amp but fast recovery.

RocksDB and Cassandra default in the range 64–256 MiB; LevelDB defaults to 4 MiB.

Persistent MemTables (PMEM)

Intel Optane / CXL persistent memory blurs the WAL+MemTable boundary: the MemTable itself lives in persistent memory, so the WAL is unnecessary. Papers from VLDB 2018–2020 (NoveLSM, SLM-DB, FloDB) explore this.

Encryption

Cassandra and RocksDB optionally encrypt at-rest data including the MemTable's flushed SSTables. The MemTable itself is in RAM and inherits process-memory protection. Encrypting in-memory pages requires hardware support (SGX, AMD SEV).

Compression of in-memory entries

For very long values, RocksDB can compress values inside the MemTable using LZ4 or ZSTD via the MemTableRep's EncodeKey hook. Trades CPU for memory; useful when RAM is the limit.

Skip-list level distribution

Pugh's original skip list uses geometric level distribution with p=0.5 (max levels = log₂ n). LevelDB sets max levels = 12 and branching = 4; RocksDB defaults max = 16, branching = 4. Lower branching → taller list → more memory but better adaptivity.

Adversarial concerns

Memory amplification via tombstones. A flood of deletes can make the MemTable hold many entries with no live data; eventually all tombstones must propagate to SSTables and may take generations of compaction to GC.
Skew-induced flush storms. A hot key prefix can keep one MemTable bucket pinned while others empty; with hash-partitioned MemTables (HashSkipList) this is pronounced.

Beyond LSM

B-epsilon trees (TokuDB / Percona) batch writes inside internal B+ tree nodes; no separate MemTable.
Anti-caching (HyPer, VoltDB) keeps the working set in memory and evicts cold rows to disk; inverts the LSM model.
WiscKey decouples keys (LSM) from values (separate log) to slash write amplification for large values.

Step 01 — Sorted map + Entry type

Build the in-memory MemTable: a sorted associative container from byte-key to an Entry that is either Value(bytes) or Tombstone. Implement put, delete, get, iter, and len / size_bytes in all three languages with the same iteration order (byte-lex).

Why this first

The MemTable's iteration order is the contract that the on-disk format and the SSTable flush both depend on. If three languages disagree on order, every later step falls apart. So this step is a one-language-after-the-other implementation of the same BTreeMap-equivalent, with a shared unit test that inserts a permutation and checks the order.

Entry type

#![allow(unused)]
fn main() {
// Rust
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum Entry {
    Value(Vec<u8>),
    Tombstone,
}
}

// Go
type EntryType uint8
const (
    EntryValue EntryType = 0
    EntryTombstone EntryType = 1
)
type Entry struct {
    Type  EntryType
    Value []byte // empty if Tombstone
}

// C++
namespace dse::memtable {
    enum class EntryType : std::uint8_t { Value = 0, Tombstone = 1 };
    struct Entry {
        EntryType type;
        std::vector<std::uint8_t> value; // empty if Tombstone
    };
}

The type-byte numbering (0 = Value, 1 = Tombstone) is part of the on-disk contract — don't reorder it.

Container choice

#![allow(unused)]
fn main() {
// Rust — Vec<u8>'s Ord is byte-lex; BTreeMap iterates in key order
use std::collections::BTreeMap;
pub struct MemTable {
    map: BTreeMap<Vec<u8>, Entry>,
    bytes: usize,
}
}

// Go — unordered map; sort keys on iteration / encode
type MemTable struct {
    m     map[string]Entry
    bytes int
}

func (t *MemTable) sortedKeys() []string {
    keys := make([]string, 0, len(t.m))
    for k := range t.m {
        keys = append(keys, k)
    }
    sort.Strings(keys) // byte-lex on string is the same as on []byte
    return keys
}

// C++ — std::map's comparator is operator< on vector<uint8_t>, which is lex
class MemTable {
    std::map<std::vector<std::uint8_t>, Entry> map_;
    std::size_t bytes_ = 0;
};

put / delete

#![allow(unused)]
fn main() {
pub fn put(&mut self, key: &[u8], value: &[u8]) {
    self.bytes -= self.entry_bytes(key);
    self.map.insert(key.to_vec(), Entry::Value(value.to_vec()));
    self.bytes += self.entry_bytes(key);
}

pub fn delete(&mut self, key: &[u8]) {
    self.bytes -= self.entry_bytes(key);
    self.map.insert(key.to_vec(), Entry::Tombstone);
    self.bytes += self.entry_bytes(key);
}

fn entry_bytes(&self, key: &[u8]) -> usize {
    match self.map.get(key) {
        None => 0,
        Some(Entry::Value(v)) => 9 + key.len() + v.len(),
        Some(Entry::Tombstone) => 9 + key.len(),
    }
}
}

Go and C++ use the same accounting trick: subtract the old entry's contribution, update, add the new contribution.

iter

#![allow(unused)]
fn main() {
pub fn iter(&self) -> impl Iterator<Item = (&[u8], &Entry)> {
    self.map.iter().map(|(k, e)| (k.as_slice(), e))
}
}

func (t *MemTable) Iter() []KeyEntry {
    out := make([]KeyEntry, 0, len(t.m))
    for _, k := range t.sortedKeys() {
        out = append(out, KeyEntry{Key: []byte(k), Entry: t.m[k]})
    }
    return out
}

const std::map<std::vector<std::uint8_t>, Entry>& Iter() const noexcept {
    return map_;
}

Test — order determinism

Insert this permutation in each language and assert iteration yields the keys in the canonical sorted order:

inputs (insert order):  ["b", "a", "", "\x00\x00", "ab", "\x00"]
expected iter order:    ["", "\x00", "\x00\x00", "a", "ab", "b"]

This catches:

Go forgetting to sort.
C++ using std::map<std::string, ...> (where '\0' ends the string and breaks comparisons on binary keys).
Anyone using a hash map.

What to verify before moving on

put then get returns the value just written.
delete then get returns Tombstone (not absent).
Overwriting a key keeps len() at 1.
The permutation test above passes.
size_bytes() increases by exactly 9 + klen + vlen for each new key and stays flat when overwriting.

Step 02 — Encode / Decode the dump

Serialize the MemTable to a byte-identical on-disk layout and parse it back.

Layout (recap from analysis.md)

  magic   "MMT1"            (4 bytes)
  count   u32 LE            (4 bytes)
  repeat count times, in ascending key order:
      klen   u32 LE
      vlen   u32 LE         (0 for tombstone)
      type   u8             (0 = Value, 1 = Tombstone)
      key    klen bytes
      value  vlen bytes

Rust

#![allow(unused)]
fn main() {
pub fn encode(&self) -> Vec<u8> {
    let mut out = Vec::with_capacity(self.size_bytes());
    out.extend_from_slice(b"MMT1");
    out.extend_from_slice(&(self.map.len() as u32).to_le_bytes());
    for (k, e) in &self.map {
        let (vlen, t, v) = match e {
            Entry::Value(v) => (v.len() as u32, 0u8, v.as_slice()),
            Entry::Tombstone => (0, 1, &[][..]),
        };
        out.extend_from_slice(&(k.len() as u32).to_le_bytes());
        out.extend_from_slice(&vlen.to_le_bytes());
        out.push(t);
        out.extend_from_slice(k);
        out.extend_from_slice(v);
    }
    out
}

pub fn decode(bytes: &[u8]) -> Result<Self, Error> {
    if bytes.len() < 8 { return Err(Error::Short); }
    if &bytes[..4] != b"MMT1" { return Err(Error::BadMagic); }
    let count = u32::from_le_bytes(bytes[4..8].try_into().unwrap()) as usize;
    let mut p = 8usize;
    let mut t = MemTable::new();
    let mut prev: Option<Vec<u8>> = None;
    for _ in 0..count {
        if p + 9 > bytes.len() { return Err(Error::Short); }
        let klen = u32::from_le_bytes(bytes[p..p+4].try_into().unwrap()) as usize;
        let vlen = u32::from_le_bytes(bytes[p+4..p+8].try_into().unwrap()) as usize;
        let ty = bytes[p+8];
        p += 9;
        if p + klen + vlen > bytes.len() { return Err(Error::Short); }
        let key = bytes[p..p+klen].to_vec();
        p += klen;
        let val = bytes[p..p+vlen].to_vec();
        p += vlen;
        if let Some(ref pk) = prev {
            if key.as_slice() <= pk.as_slice() { return Err(Error::Unsorted); }
        }
        let entry = match ty {
            0 => { Entry::Value(val) }
            1 => { if vlen != 0 { return Err(Error::BadTombstone); } Entry::Tombstone }
            _ => return Err(Error::BadType),
        };
        prev = Some(key.clone());
        t.insert_raw(key, entry);
    }
    if p != bytes.len() { return Err(Error::Trailing); }
    Ok(t)
}
}

Go

func (t *MemTable) Encode() []byte {
    out := make([]byte, 0, t.SizeBytes())
    out = append(out, 'M', 'M', 'T', '1')
    out = binary.LittleEndian.AppendUint32(out, uint32(len(t.m)))
    for _, k := range t.sortedKeys() {
        e := t.m[k]
        out = binary.LittleEndian.AppendUint32(out, uint32(len(k)))
        out = binary.LittleEndian.AppendUint32(out, uint32(len(e.Value)))
        out = append(out, byte(e.Type))
        out = append(out, k...)
        out = append(out, e.Value...)
    }
    return out
}

Decode mirrors the Rust shape: read header, then loop reading klen, vlen, type, key, value, validating ascending key order and rejecting trailing bytes.

C++

std::vector<std::uint8_t> MemTable::Encode() const {
    std::vector<std::uint8_t> out;
    out.reserve(SizeBytes());
    static constexpr std::uint8_t magic[4] = {'M','M','T','1'};
    out.insert(out.end(), magic, magic + 4);
    PutU32LE(out, static_cast<std::uint32_t>(map_.size()));
    for (auto const& [k, e] : map_) {
        PutU32LE(out, static_cast<std::uint32_t>(k.size()));
        PutU32LE(out, static_cast<std::uint32_t>(e.value.size()));
        out.push_back(static_cast<std::uint8_t>(e.type));
        out.insert(out.end(), k.begin(), k.end());
        out.insert(out.end(), e.value.begin(), e.value.end());
    }
    return out;
}

What the decoder must reject

Input	Why it must fail
< 8 bytes	header truncated
magic `XXXX`	bad format
count says 5 but only 3 entries fit	truncated body
type byte `2`	unknown type
tombstone with `vlen != 0`	malformed
keys not strictly ascending	violates order invariant
trailing bytes after last entry	corruption

How to spot-check

After encoding a 2-entry MemTable (alpha=first, beta tombstoned):

xxd /tmp/m.bin | head
00000000: 4d4d 5431 0200 0000 0500 0000 0500 0000  MMT1............
00000010: 0061 6c70 6861 6669 7273 7404 0000 0000  .alphafirst.....
00000020: 0000 0001 6265 7461                      ....beta

Three things to verify by eye:

4d4d5431 = MMT1.
0200 0000 = count of 2 (LE).
The tombstone's vlen is 0000 0000 and the byte before its key is 01.

Test V6 — round-trip 50 entries

Mix puts and deletes:

for i in 0..50:
    if i % 5 == 0: t.delete(format!("key{i}").as_bytes())
    else:          t.put(format!("key{i}").as_bytes(), format!("val{i}").as_bytes())
encoded = t.encode()
roundtrip = MemTable::decode(&encoded).unwrap()
assert_eq!(roundtrip.len(), t.len())
for (k, e) in t.iter() {
    assert_eq!(roundtrip.get(k), Some(e))
}
assert_eq!(roundtrip.encode(), encoded)

The last line is the idempotence check — decoding and re-encoding produces the same bytes. If it doesn't, we have non-determinism in iteration order, which will also break cross-language interop.

Step 03 — CLI + cross-language interop

Wrap the library in a uniform CLI and prove that all three implementations produce byte-identical dumps for the same scripted scenario.

The `memtable` binary

Every language exposes the same subcommands so the cross-test can drive them uniformly:

memtable new    PATH
memtable put    PATH KEY VALUE
memtable del    PATH KEY
memtable get    PATH KEY
memtable iter   PATH
memtable bulk   PATH N
memtable size   PATH

KEY and VALUE are passed as raw command-line strings. They may contain printable bytes; for testing we stick to ASCII to avoid shell quoting issues.
iter and get print hex (lowercase, no separator) so output is shell-safe.

Output formats

# iter
V <hex-key> <hex-value>
T <hex-key>

# get
value: <hex-value>
tombstone
absent

# size
entries=<N> size_bytes=<B>

The scripted scenario

scripts/cross_test.sh drives every language through this sequence:

new                                          # 8 bytes, empty
bulk 100                                     # 100 entries key0..key99 / val0..val99
put  key50  REPLACED                         # overwrite
del  key10                                   # tombstone
put  ""     empty-key-value                  # empty key as a valid key
del  key99                                   # tombstone at the tail

Then it dumps rust.bin, go.bin, cpp.bin and asserts:

shasum -a 256 rust.bin go.bin cpp.bin
# all three hashes must be identical

3×3 reader matrix

For every writer × reader combination, the script runs

$READER iter $WRITER.bin > out.${reader}.${writer}.txt

and diffs pairs of outputs. All nine outputs must agree byte-for-byte.

Why a `bulk` subcommand

Running 100 separate memtable put PATH key0 val0 … invocations would (a) thrash the disk and (b) test the CLI's argument parsing more than the data structure. bulk exists so the cross-test can build a non-trivial table in one process per language.

Spot-check `get` results

After the scenario the script also runs

get key50         # expect 'value: 5245504c41434544'   (REPLACED in hex)
get key10         # expect 'tombstone'
get key99         # expect 'tombstone'
get ""            # expect 'value: 656d7074792d6b65792d76616c7565'  (empty-key-value)
get nonexistent   # expect 'absent'

across all three readers.

Failure messages worth designing for

$ memtable get /tmp/m bogus
absent

$ memtable get /tmp/no-such-file foo
error: read /tmp/no-such-file: No such file or directory

$ memtable get /tmp/garbage.bin foo
error: bad magic

A consistent error vocabulary across languages makes the cross-test's grep patterns simpler.

Tying it together

scripts/verify.sh runs:

Rust tests (cargo test --release).
Go tests (go test ./...).
C++ tests (cmake -S . -B build && cmake --build build && ctest).
The cross-language script.

Final stdout must end with ALL GREEN.

SSTable Format

1. What Is It

A Sorted String Table (SSTable) is an immutable on-disk file holding key/value entries in byte-lex key order, organised into fixed-size blocks with an index block that maps each block's first key to its byte range inside the file, and a fixed-size footer that locates the index block.

The format in this lab:

+--------------------+   0
| data block 0       |
+--------------------+
| data block 1       |
+--------------------+
| ...                |
+--------------------+
| data block N-1     |
+--------------------+   index_offset
| index block        |
+--------------------+   file_size - 32
| footer (32 bytes)  |
+--------------------+   file_size

The footer always lives in the last 32 bytes and ends with the magic SST1\0\0\0\0, so any reader can validate the file with one pread of the tail and then pread the index block, and only then the relevant data block.

2. Why It Matters

Read-once, write-never. Each SSTable is written sequentially and then treated as read-only. That eliminates most concurrency hazards: lookups, range scans, and compactions can all share a single immutable file.
Bounded read amplification. A point lookup is footer → index → one data block. With a 4 KB block and a 16-byte average entry, ≤256 keys are scanned per lookup, regardless of file size.
Predictable I/O. Blocks are aligned write units; the OS page cache can pin hot blocks. Tail latency is dominated by exactly two I/Os per miss (index + data).
Foundation for LSM. Compaction merges multiple immutable SSTables into new immutable SSTables. The format is what makes "immutable + sorted + indexed" a usable storage primitive.

3. How It Works

3.1 Data block

A data block is a self-describing run of entries.

[count: u32 LE]
repeat count times (keys ascending byte-lex within the block):
  [klen: u32 LE][vlen: u32 LE][type: u8][key bytes][value bytes]

The writer flushes a block when its accumulated size would exceed a target (default 4096 bytes). The very first key of each block is the index key for that block.

3.2 Index block

[count: u32 LE]
repeat count times:
  [klen: u32 LE][offset: u64 LE][size: u64 LE][first-key bytes]

offset and size locate the data block inside the file. Index entries are listed in ascending block order, which is the same as ascending first-key order.

[index_offset: u64 LE]
[index_size:   u64 LE]
[num_blocks:   u64 LE]
[magic:        "SST1\0\0\0\0" (8 bytes)]

The fixed size makes the tail trivially locatable: pread(fd, buf, 32, file_size - 32).

3.4 Point lookup

Read footer; verify magic.
Read index block (index_offset..index_offset+index_size).
Binary-search the index for the rightmost index entry whose first key ≤ target. That entry's block is the only one that can contain the target.
Read that block; linear-scan within it.

A miss in step 3 (target < first entry's key) means the key is absent without reading any data block.

4. Core Terminology

Term	Definition
SSTable	Immutable sorted on-disk K/V file.
Block	Contiguous run of entries inside the SSTable; the I/O and indexing unit.
Data block	Block containing user K/V entries.
Index block	Block mapping each data block's first key to `(offset, size)`.
Footer	Fixed-size tail (32 B) locating the index block; ends with a magic.
Magic	Sentinel byte pattern (`SST1` here) used to validate file identity.
Index key	The first key of a data block, copied into the index entry.
Block boundary	The byte position where one data block ends and the next begins.
Restart point	(Not used here; LevelDB-style intra-block delta-encoding marker.)
Tombstone	Entry whose `type=1` records that a key has been logically deleted.

5. Mental Models

Phone book. Data blocks are pages; the index is the alphabetical tabs on the side. The footer is the spine label saying "Volume 3 of 3".
Skiplist with one level. The index is a single coarse "level" above the sorted data; binary search on the index replaces a multi- level skiplist traversal.
Two-tier B+tree. Conceptually an SSTable is a 2-level B+tree whose leaves are data blocks and whose root is the index block — but built sequentially and frozen.

6. Common Misconceptions

"You need to scan the whole file to find a key." No — one index lookup pins a single block.
"The index must be at the start so you can read it first." No — the footer pointer makes the index location flexible, and writing the index last avoids buffering all keys before any data is flushed.
"Block size = block contents size." The on-disk block includes its own count header; the writer tracks an estimate of accumulated bytes so blocks land roughly on the target.
"Tombstones can be dropped at write time." Not safely — a tombstone must survive until no older SSTable can shadow it (handled by compaction in db-07).
"Binary search needs fixed-size index entries." The index entries here are variable-length, but the index block itself is small and fully loaded into RAM, where any search structure is cheap.

7. Interview Talking Points

Why is the footer at the end? ("So the writer can stream data blocks then index without two passes; one pread of the tail finds everything.")
What changes if you target 64 KB blocks instead of 4 KB? (Fewer index entries → smaller index → faster directory lookups; larger read amplification per miss; better compression ratios.)
How does this format become durable? (fsync after the footer is written, and a parent directory fsync so the dirent is recoverable. Without it a crash can leave the magic visible but data missing.)
What is bsearch looking for inside the index? (The floor — the largest first-key ≤ target — not equality.)
What stops a corrupt footer from poisoning the reader? (Magic check
- length plausibility checks. Real systems add CRCs per block.)

8. Connections to Other Labs

db-05 MemTable supplies the sorted, in-memory K/V stream that this writer drains into blocks.
db-04 Bloom Filters can be attached per SSTable to skip the index lookup on negative queries (added in db-08 / db-09).
db-07 LSM Compaction consumes many SSTables and produces new ones using exactly this format.
db-08 Block Cache and Iterators caches the parsed data block rather than re-decoding on each lookup, and turns intra-block scans into iterators.
db-09 LevelDB Complete stitches MemTable + WAL + SSTable + compaction + bloom into a working engine.

References — SSTable Format

Papers

O'Neil, P., Cheng, E., Gawlick, D., O'Neil, E. "The Log-Structured Merge-Tree (LSM-Tree)." Acta Informatica 33(4), 1996. — Original description of the immutable run / multi-level merge architecture.
Chang, F. et al. "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006. — Introduces the SSTable term and the blocks-plus-index layout this lab mirrors.

Open-source implementations

LevelDB table/format.h, table/table_builder.cc, table/block_builder.cc, table/block.cc — canonical reference for this lab. The data block format here is the LevelDB block format with restart-point compression removed for clarity.
RocksDB table/block_based/block_based_table_builder.cc — adds bloom-filter blocks, compression, and partitioned indices on top of the same skeleton.
CockroachDB Pebble sstable/writer.go and sstable/reader.go — Go implementation in idiomatic style.

Articles

LevelDB design doc: https://github.com/google/leveldb/blob/main/doc/table_format.md
RocksDB BlockBasedTable: https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format
"Building an LSM Storage Engine: SSTables" — Mini-LSM tutorial, Chen et al.: https://skyzh.github.io/mini-lsm/

Analysis — SSTable Format

Problem

Take a sorted stream of K/V (and tombstone) entries — exactly what db-05 produces — and persist it as an immutable, randomly-readable file:

writing is one sequential pass (no re-reads, no buffering all keys);
a point lookup costs O(log N) on the index plus one block read;
the file is self-describing: a reader can validate and navigate it using only the file itself.

Constraints

4 KB target data-block size (close to a page; tunable).
Little-endian integers throughout (matches db-03 / db-05).
No per-block CRCs in this lab — added in db-21 ("Storage Engine Advanced"). The footer magic is the only integrity gate.
No compression, no delta-encoded keys: the goal is a format simple enough to compare byte-for-byte across three languages.
Cross-language interop: Rust, Go, and C++ MUST emit byte-identical SSTables for the same input MemTable.

Design

Stream-once writer

sst_writer.add(key, type, value) -> writes into current data block buffer
sst_writer.finish() -> flushes the current block, writes index, writes footer

The writer accumulates one block in memory at a time. When adding an entry would push the encoded block size past 4096 bytes, the current block is flushed and a new one started. The first key of every block is captured into an IndexEntry { key, offset, size }.

Index sizing

Index entries are ~ (4 + 8 + 8 + k̄) = 20 + k̄ bytes; for k̄ = 16 and ~250 entries per 4 KB block, a 1 GB SSTable carries ~262 144 blocks × 36 B ≈ 9 MB of index — small enough to keep in RAM per open SSTable.

Lookup

fn get(key) -> Option<Entry>:
    footer = read_tail(32)
    assert footer.magic == "SST1\0\0\0\0"
    index = read(footer.index_offset, footer.index_size)
    blk = bsearch_floor(index, key)?                # None => below smallest
    block = read(blk.offset, blk.size)
    return linear_scan(block, key)

bsearch_floor is the rightmost index entry whose first key ≤ target. Returning None (target precedes the smallest first-key) is a fast miss without reading any data block.

Per-language container choice

Language	Writer buffer	Index repr
Rust	`Vec<u8>` for the current block	`Vec<IndexEntry>`
Go	`[]byte`	`[]IndexEntry`
C++	`std::vector<uint8_t>`	`std::vector<IndexEntry>`

IndexEntry is (Vec<u8> key, u64 offset, u64 size) in all three.

Build-from-memtable bridge

For cross-test friendliness, the writer's input source is a decoded MemTable dump (the output of db-05 encode). The CLI command build reads IN.mt, iterates in sorted order, and emits OUT.sst.

What could break

Block boundary drift. If two implementations disagree on when to flush a block (e.g. > 4096 vs >= 4096), the data blocks land at different offsets, the index differs, and the footer hashes differ. We pin the rule: *flush when `current_block_size + next_entry_size

4096ANDcurrent_block_size > 0`*.
Index encoding for the very first block. Its first key may be the empty string ""; the index entry then has klen=0. The reader must still treat it as the floor for any non-empty target.
Footer alignment. Anything other than exactly 32 bytes after the index block invalidates the magic offset.

Execution — SSTable Format

Library API (uniform across Rust / Go / C++)

struct Entry { type: 0|1; value: bytes }     // tombstone => value == empty
struct IndexEntry { key: bytes; offset: u64; size: u64 }
struct Footer { index_offset: u64; index_size: u64; num_blocks: u64; magic: "SST1\0\0\0\0" }

const BLOCK_TARGET: usize = 4096
const FOOTER_LEN:   usize = 32
const MAGIC:        &[u8; 8] = b"SST1\0\0\0\0"

// ---- writer ----
SstWriter::new(target_block_size = BLOCK_TARGET)
SstWriter::add(&mut self, key: &[u8], entry: Entry)   // keys MUST be strictly ascending
SstWriter::finish(&mut self) -> Vec<u8>               // returns full SSTable bytes

// ---- reader ----
SstReader::open(bytes: &[u8]) -> Result<Self, Error>
SstReader::len(&self) -> usize                         // num entries
SstReader::num_blocks(&self) -> usize
SstReader::get(&self, key: &[u8]) -> Option<Entry>     // None if absent OR tombstone is not skipped
SstReader::iter(&self) -> impl Iterator<Item=(&[u8], Entry)>  // full file scan

Error variants: Short, BadMagic, BadBlock, Unsorted, BadTombstone, BadType, IndexOutOfRange.

CLI

The binary is named sstable in every language and dispatches on the first arg:

sstable build  IN.mt OUT.sst         # read MemTable dump, write SSTable
sstable footer FILE.sst              # print: index_offset=... index_size=... num_blocks=... magic_ok=...
sstable get    FILE.sst KEY          # prints: value: <hex> | tombstone | absent
sstable iter   FILE.sst              # prints lines: V <hex-key> <hex-value> | T <hex-key>
sstable size   FILE.sst              # prints: file_bytes=B entries=N num_blocks=K

Output formats match db-05 deliberately so the same cross-test helpers (hex iter, value:/tombstone/absent get) apply.

Worked example

Given memtable bulk M.mt 100 && memtable put M.mt key50 REPLACED && memtable del M.mt key10, calling sstable build M.mt OUT.sst does:

Decode M.mt (MemTable format from db-05).
Iterate in sorted order; for each entry, call writer.add.
The writer accumulates entries into a 4096-byte data-block buffer. When the next entry would overflow, it flushes the buffer:
- records IndexEntry { key = first_key_of_block, offset, size },
- appends the encoded block to the output stream,
- resets the buffer with the just-added entry.
After the last add, finish flushes the final block, then writes the index block, then a 32-byte footer ending in SST1\0\0\0\0.

The output file is then self-validating: sstable footer OUT.sst prints the footer values, sstable iter OUT.sst reproduces every entry in sorted order, and sstable get OUT.sst key50 returns value: 5245504c41434544.

Observation — SSTable Format

Smallest possible SSTable

Build from an empty MemTable: zero entries, zero data blocks, an empty index, and just the footer.

file size: 0 + 4 (index count=0) + 32 (footer) = 36 bytes

Hex (annotated):

offset
0000:  00 00 00 00                                          # index block: count=0
0004:  00 00 00 00 00 00 00 00   index_offset = 0
000c:  04 00 00 00 00 00 00 00   index_size   = 4
0014:  00 00 00 00 00 00 00 00   num_blocks   = 0
001c:  53 53 54 31 00 00 00 00   magic        = "SST1\0\0\0\0"

File-size formula

For a build with N entries spread across K data blocks where the sum of key sizes is Σk and the sum of value sizes (only for non-tombstone entries) is Σv:

data_bytes  = Σ_blocks ( 4 + Σ_entries_in_block (9 + k + v) )
            = 4·K + N·9 + Σk + Σv
index_bytes = 4 + Σ_blocks ( 4 + 8 + 8 + first_key_len )
            = 4 + K·20 + Σ_block_first_key_lens
file_bytes  = data_bytes + index_bytes + 32

(The 4-byte per-block header is the entry count. The 20-byte per-index-entry overhead is klen u32 + offset u64 + size u64.)

Hex walkthrough of a 3-entry SSTable

Three small entries forced into one block by the small block target — e.g. put a 1, put bb 22, del ccc:

00000000  03 00 00 00                       count=3
00000004  01 00 00 00 01 00 00 00 00 'a' '1'              # entry 1: klen=1 vlen=1 type=0 "a" "1"
00000011  02 00 00 00 02 00 00 00 00 'b' 'b' '2' '2'      # entry 2: klen=2 vlen=2 type=0 "bb" "22"
0000001e  03 00 00 00 00 00 00 00 01 'c' 'c' 'c'          # entry 3: klen=3 vlen=0 type=1 "ccc"

00000028  01 00 00 00                       # index count=1
0000002c  01 00 00 00 00 00 00 00 00 00 00 00 28 00 00 00 00 00 00 00 'a'   # klen=1 offset=0 size=0x28 "a"

00000048  00 00 00 00 00 00 00 00           # footer.index_offset = 0x28
00000050  19 00 00 00 00 00 00 00           # footer.index_size   = 0x19
00000058  01 00 00 00 00 00 00 00           # footer.num_blocks   = 1
00000060  53 53 54 31 00 00 00 00           # magic "SST1\0\0\0\0"
                                            #   (file size = 0x68 = 104 bytes)

Note that the first key of the single block is "a", so the index entry copies that key.

What broken looks like

Symptom	Likely cause
`BadMagic` at open	file truncated, or footer overwritten by an interrupted writer.
`BadBlock` reading a block	block size in the index disagrees with the in-file count header — e.g. wrong endianness.
Two languages produce different file sizes for identical input	block-flush rule mismatch (`>` vs `>=`).
`Unsorted` from the writer	caller didn't iterate the MemTable in sorted order before `add`.
`IndexOutOfRange` at read	corrupted `offset`/`size` in the index — checked against `file_len - 32` to fail loudly.

Verification — SSTable Format

V1: empty build

build from an empty MemTable produces a 36-byte file ending in SST1\0\0\0\0 with index_offset=0, index_size=4, num_blocks=0.

V2: single-entry build

add("k", Value("v")) → finish yields a file that:

contains exactly one data block,
has one index entry with key "k",
round-trips: iter returns [("k", Value("v"))], get("k") returns the value, get("missing") returns None.

V3: tombstones survive

A tombstone added during build is reported as Some(Entry::Tombstone) by get and as T <hex-key> by iter.

V4: ascending-key precondition

Calling add with a key that is not strictly greater than the previous added key MUST return Unsorted and leave no partial output.

V5: block-boundary rule

With target_block_size = 64 and inputs whose encoded sizes are known, the writer flushes the running block as soon as adding the next entry would push its size strictly greater than 64 bytes. A test inserts entries crafted so that the second insert is the boundary-crossing one and asserts the resulting file has exactly two data blocks and two index entries.

For any file produced by finish:

the last 32 bytes parse as a Footer,
magic == "SST1\0\0\0\0",
index_offset + index_size + 32 == file_size,
num_blocks matches the count of index entries.

V7: bulk round-trip vs MemTable

Take a MemTable populated by bulk 100 + put key50 REPLACED + del key10 + put "" empty-key-value + del key99, build an SSTable from it, then verify that iter-over-SSTable returns the exact same (key, entry) sequence as iter-over-MemTable. Per-key get agrees too.

V8: floor lookup correctness

For a multi-block SSTable, get(target) returns the matching entry when present and None when the target falls between blocks, even though the index entry it lands on belongs to the preceding block.

V9: reader rejects bad magic

A file with the last 8 bytes mutated away from SST1\0\0\0\0 MUST return BadMagic on open.

V10: reader rejects out-of-range index pointer

A file whose footer claims index_offset >= file_size - 32 MUST return IndexOutOfRange on open (caught before any block read).

Broader Ideas — SSTable Format

Restart points and prefix compression. LevelDB stores keys as (shared_prefix_len, unshared_suffix, value) and resets the prefix every N entries (a "restart point"). The block trailer lists the restart offsets so binary search inside a block is still O(log N_restarts). Halves on-disk size for sorted key sets but couples decode to encode order.
Two-level / partitioned index. A 1 TB SSTable would have ~10 GB of index entries. Partitioning the index into "index blocks of index blocks" keeps the resident index small at the cost of one extra pread per miss. RocksDB uses this above ~2 GB SSTables.
Per-block bloom filters. Attaching a small Bloom (db-04) to each data block lets the reader skip the entire block on a miss without decoding it. Trades index/Bloom RAM for fewer block reads.
Block CRCs / per-block compression. Real engines write [data][type byte: compression][crc32c] per block; the writer computes the CRC over compressed bytes. Detects bit-rot at read time but adds CPU cost per block.
Streaming writer to disk. This lab returns Vec<u8>; production writers stream blocks into an os.File and only buffer the index in RAM. With a 4 KB block target and 250 entries/block, peak RAM is ~4 KB + the growing index.
Min/max keys per block in the index. Index entries can carry the last key too, so a query strictly between two blocks short-circuits without reading either. Costs ~2× index size.
Splitting hot blocks. Some engines (e.g. CockroachDB) measure read frequency per block and adaptively shrink hot blocks to reduce read amplification on small lookups.
Versioned magic. A future format change (e.g. adding bloom blocks) bumps the magic to SST2; readers can keep both code paths and choose at open time. Cheap, common practice.

Step 01 — Data Block Writer

Goal

Implement the smallest unit of an SSTable: a data block builder that accumulates entries and emits the on-disk block bytes.

Block format (recap)

[count: u32 LE]
repeat count times (keys ascending within the block):
  [klen: u32 LE][vlen: u32 LE][type: u8][key][value]

The block does not carry its own size — the index entry that points to it does.

Encoded entry size

entry_size(klen, vlen) = 9 + klen + vlen   # 4 + 4 + 1 + key + value

A block that holds N entries occupies 4 + Σ entry_size.

Flush rule

Track current_size = 4 (the block header) and the buffer separately. For each candidate entry (k, v):

sz = entry_size(len(k), len(v))
if buffer_non_empty AND current_size + sz > BLOCK_TARGET:
    flush()                # emit block, capture index entry, reset
push entry
current_size += sz

This rule allows the block to grow up to and including BLOCK_TARGET bytes but never beyond. A single oversized entry is emitted alone in its own block (block size grows past the target only when one entry already exceeds it).

Side-by-side: Rust / Go / C++

Rust

#![allow(unused)]
fn main() {
const HEADER: usize = 4;
fn entry_size(k: usize, v: usize) -> usize { 9 + k + v }

struct BlockBuilder {
    buf: Vec<u8>,
    count: u32,
    first_key: Option<Vec<u8>>,
}

impl BlockBuilder {
    fn new() -> Self {
        let mut buf = Vec::with_capacity(BLOCK_TARGET);
        buf.extend_from_slice(&0u32.to_le_bytes()); // placeholder for count
        Self { buf, count: 0, first_key: None }
    }
    fn current_size(&self) -> usize { self.buf.len() }
    fn add(&mut self, key: &[u8], ty: u8, value: &[u8]) {
        if self.count == 0 { self.first_key = Some(key.to_vec()); }
        self.buf.extend_from_slice(&(key.len() as u32).to_le_bytes());
        self.buf.extend_from_slice(&(value.len() as u32).to_le_bytes());
        self.buf.push(ty);
        self.buf.extend_from_slice(key);
        self.buf.extend_from_slice(value);
        self.count += 1;
    }
    fn finish(mut self) -> (Vec<u8>, Vec<u8>) {
        self.buf[0..4].copy_from_slice(&self.count.to_le_bytes());
        (self.buf, self.first_key.unwrap_or_default())
    }
}
}

Go

type blockBuilder struct {
    buf      []byte
    count    uint32
    firstKey []byte
}

func newBlock() *blockBuilder {
    b := &blockBuilder{buf: make([]byte, 0, blockTarget)}
    b.buf = binary.LittleEndian.AppendUint32(b.buf, 0) // placeholder
    return b
}
func (b *blockBuilder) currentSize() int { return len(b.buf) }
func (b *blockBuilder) add(key []byte, ty byte, value []byte) {
    if b.count == 0 { b.firstKey = append([]byte(nil), key...) }
    b.buf = binary.LittleEndian.AppendUint32(b.buf, uint32(len(key)))
    b.buf = binary.LittleEndian.AppendUint32(b.buf, uint32(len(value)))
    b.buf = append(b.buf, ty)
    b.buf = append(b.buf, key...)
    b.buf = append(b.buf, value...)
    b.count++
}
func (b *blockBuilder) finish() (block, firstKey []byte) {
    binary.LittleEndian.PutUint32(b.buf[0:4], b.count)
    return b.buf, b.firstKey
}

C++

struct BlockBuilder {
    std::vector<uint8_t> buf;
    uint32_t count = 0;
    std::vector<uint8_t> first_key;

    BlockBuilder() {
        buf.reserve(kBlockTarget);
        put_u32_le(buf, 0);                // placeholder
    }
    size_t current_size() const { return buf.size(); }
    void add(const uint8_t* k, size_t klen,
             uint8_t ty,
             const uint8_t* v, size_t vlen) {
        if (count == 0) first_key.assign(k, k + klen);
        put_u32_le(buf, uint32_t(klen));
        put_u32_le(buf, uint32_t(vlen));
        buf.push_back(ty);
        buf.insert(buf.end(), k, k + klen);
        buf.insert(buf.end(), v, v + vlen);
        ++count;
    }
    std::pair<std::vector<uint8_t>, std::vector<uint8_t>> finish() {
        std::memcpy(buf.data(), &count, 4); // LE on the platforms we target
        return {std::move(buf), std::move(first_key)};
    }
};

(For portability, the C++ version uses put_u32_le to patch the count header too in the real implementation; the memcpy shortcut works on little-endian hosts but the lab uses the helper.)

Self-check

Empty finish returns (b"\x00\x00\x00\x00", b"").
After three adds the buffer length equals 4 + Σ entry_size.
first_key is captured on the first add and never overwritten.

Goal

Wire the data-block builder into a full SstWriter that emits [blocks...][index][footer].

Writer state

SstWriter {
    out: Vec<u8>,            // file bytes accumulated so far
    block: BlockBuilder,     // current data block
    index: Vec<IndexEntry>,  // one entry per flushed block
    target: usize,           // BLOCK_TARGET (default 4096)
    last_key: Option<Vec<u8>>,
}

add

fn add(&mut self, key: &[u8], ty: u8, value: &[u8]) -> Result<(), Error> {
    if let Some(prev) = &self.last_key {
        if key <= prev.as_slice() { return Err(Error::Unsorted); }
    }
    if ty == 1 && !value.is_empty() { return Err(Error::BadTombstone); }
    let sz = entry_size(key.len(), value.len());
    if self.block.count > 0 && self.block.current_size() + sz > self.target {
        self.flush_block();
    }
    self.block.add(key, ty, value);
    self.last_key = Some(key.to_vec());
    Ok(())
}

flush_block

fn flush_block(&mut self) {
    let mut blk = std::mem::replace(&mut self.block, BlockBuilder::new());
    let (bytes, first_key) = blk.finish();
    let offset = self.out.len() as u64;
    let size   = bytes.len() as u64;
    self.out.extend_from_slice(&bytes);
    self.index.push(IndexEntry { key: first_key, offset, size });
}

finish

fn finish(mut self) -> Vec<u8> {
    if self.block.count > 0 { self.flush_block(); }
    let index_offset = self.out.len() as u64;
    self.out.extend_from_slice(&(self.index.len() as u32).to_le_bytes());
    for e in &self.index {
        self.out.extend_from_slice(&(e.key.len() as u32).to_le_bytes());
        self.out.extend_from_slice(&e.offset.to_le_bytes());
        self.out.extend_from_slice(&e.size.to_le_bytes());
        self.out.extend_from_slice(&e.key);
    }
    let index_size = self.out.len() as u64 - index_offset;
    let num_blocks = self.index.len() as u64;
    self.out.extend_from_slice(&index_offset.to_le_bytes());
    self.out.extend_from_slice(&index_size.to_le_bytes());
    self.out.extend_from_slice(&num_blocks.to_le_bytes());
    self.out.extend_from_slice(b"SST1\0\0\0\0");
    debug_assert_eq!(self.out.len() as u64,
                     index_offset + index_size + FOOTER_LEN as u64);
    self.out
}

fn parse_footer(file: &[u8]) -> Result<Footer, Error> {
    if file.len() < FOOTER_LEN { return Err(Error::Short); }
    let tail = &file[file.len() - FOOTER_LEN..];
    if &tail[24..32] != b"SST1\0\0\0\0" { return Err(Error::BadMagic); }
    Ok(Footer {
        index_offset: u64::from_le_bytes(tail[ 0.. 8].try_into().unwrap()),
        index_size:   u64::from_le_bytes(tail[ 8..16].try_into().unwrap()),
        num_blocks:   u64::from_le_bytes(tail[16..24].try_into().unwrap()),
    })
}

The reader then verifies footer.index_offset + footer.index_size + 32 == file.len() (returns IndexOutOfRange otherwise) and parses the index block.

Index block parse

Identical structure to write:

let mut p = footer.index_offset as usize;
let count = read_u32_le(&file[p..]); p += 4;
let mut idx = Vec::with_capacity(count as usize);
for _ in 0..count {
    let klen = read_u32_le(&file[p..]) as usize;            p += 4;
    let off  = read_u64_le(&file[p..]);                     p += 8;
    let sz   = read_u64_le(&file[p..]);                     p += 8;
    let key  = file[p..p+klen].to_vec();                    p += klen;
    if off + sz > footer.index_offset {                     // beyond data region
        return Err(Error::IndexOutOfRange);
    }
    idx.push(IndexEntry { key, offset: off, size: sz });
}

get with floor binary search

fn get(&self, key: &[u8]) -> Option<Entry> {
    // Floor = rightmost index entry whose first_key <= key.
    let pos = match self.index.binary_search_by(|e| e.key.as_slice().cmp(key)) {
        Ok(i)  => i,                  // exact first-key match
        Err(0) => return None,        // key precedes the smallest block
        Err(i) => i - 1,
    };
    let blk = &self.index[pos];
    let block_bytes = &self.file[blk.offset as usize
                                .. (blk.offset + blk.size) as usize];
    scan_block(block_bytes, key)
}

scan_block decodes entries in order and returns the matching one when found, None otherwise (a hit on a tombstone returns Some(Entry::Tombstone) — the engine layer decides what tombstones mean).

Self-check

An empty writer: finish() length is exactly 36 (4-byte empty index
- 32-byte footer).
After one add, file length = 4 + 9 + |k| + |v| (block) + 4 + 4 + 8 + 8 + |k| (index) + 32 (footer).
For a target of 64 and entries crafted with sizes 30, 30, 30: the first add fits (4+30=34 ≤ 64), the second triggers flush (34+30=64? — equals target, no flush; 34+30=64 ≤ 64), then the third (64+30=94 > 64) → flush; result: two data blocks.

Step 03 — CLI and Cross-Language Test

CLI surface

sstable build  IN.mt OUT.sst        # MemTable dump in → SSTable out
sstable footer FILE.sst              # prints footer values + magic_ok
sstable get    FILE.sst KEY          # value: <hex> | tombstone | absent
sstable iter   FILE.sst              # V <hex-key> <hex-value> | T <hex-key>
sstable size   FILE.sst              # file_bytes=B entries=N num_blocks=K

The hex-encoding and value: / tombstone / absent strings match db-05 so the cross-test reuses the same comparison logic.

Cross-test scenario

Identical input across all three languages:

memtable new        M.mt
memtable bulk       M.mt 100
memtable put        M.mt key50 REPLACED
memtable del        M.mt key10
memtable put        M.mt ""     empty-key-value
memtable del        M.mt key99
sstable  build      M.mt OUT.sst

Cross-test checks:

Byte identity. sha256 of OUT.sst matches across rust / go / c++. (Same input MemTable dump + same writer rules ⇒ same bytes.)
3×3 iter matrix. Every reader can iterate every writer's output, producing identical line-by-line dumps.
3×3 footer parse. sstable footer OUT.sst from every reader on every writer's output reports the same index_offset / index_size / num_blocks and magic_ok=true.
Spot-check get. For each language: get key50 → value: 5245504c41434544, get key10 → tombstone, get "" → value: 656d7074792d6b65792d76616c7565, get nope → absent.
Iter equivalence vs MemTable. sstable iter OUT.sst matches memtable iter M.mt byte-for-byte (the SSTable preserves the sorted entry stream, including tombstones).

Block-boundary check

With 100 small entries (key0..key99 → val0..val99, encoded ≈ 16 bytes each), a 4096-byte block target produces roughly 100 / (4096 / 16) ≈ 1 data blocks but with the +9 overhead per entry it lands at 1 or 2 blocks. The cross-test asserts only that num_blocks ≥ 1 and that every reader agrees on the count.

A separate sub-test forces a small block target (64 bytes) on identical input across the three languages and asserts the resulting num_blocks value matches; this is the precise boundary-rule check.

Output formats (exact strings)

Command	Format
`footer`	`index_offset=<N> index_size=<N> num_blocks=<N> magic_ok=<true\|false>`
`get`	`value: <hex>` \| `tombstone` \| `absent`
`iter` value	`V <hex-key> <hex-value>`
`iter` tombstone	`T <hex-key>`
`size`	`file_bytes=<N> entries=<N> num_blocks=<N>`

The cross-test scripts diff these as plain text.

db-07: LSM Compaction

0. Why compaction at all?

The LSM write path (db-05 MemTable + db-06 SSTable) is intentionally append-only. When a key is updated, the new version is written to a fresh MemTable and later flushed to a fresh SSTable; the old version is still sitting in some older SSTable. When a key is deleted, a tombstone is written, not a removal.

Without compaction, three pathologies grow without bound:

Pathology	Symptom	Bound
Read amplification	A single get() must check every live SSTable.	O(#SSTables)
Space amplification	Obsolete versions and tombstones keep occupying disk.	Total writes / live bytes
Index/metadata bloat	Reader has to load every SSTable's index.	O(#SSTables)

Compaction merges N input SSTables into M output SSTables, applying newest wins semantics and (eventually) purging tombstones. It trades extra write I/O (write amplification) for bounded read and space amplification.

1. The two strategies (one sentence each)

Leveled (LevelDB, RocksDB default): level L holds at most ~10× the bytes of level L-1. When a level is full, you pick one file and compact it against the overlapping files in L+1. Read amp ≈ #levels; space amp ≈ 1.1×.
Tiered (Cassandra default, Pebble's "level 0"): when level L has K files, merge all of them into a single L+1 file. Read amp ≈ #levels × K; space amp can be 2–3×; write amp is much lower.

This lab implements neither policy. It implements the mechanism they both need: a correct K-way merge that respects recency ordering and tombstones. Picking the policy is a separate problem (and a configurable one).

2. The mechanism: K-way merge

Inputs: an ordered list of SSTables [A, B, C, ...], where A is the newest (most recently written) and the rest follow in age order.

Output: a single SSTable whose entries are the sorted union of all input keys, where for any key k the entry is taken from the first input that contains it. Tombstones are entries — they win against older values just like a put.

The merge is a textbook K-way merge:

Open all inputs and produce per-input cursors that iterate in key order.
Push each cursor's current key into a min-heap keyed by (key, source_index). source_index is the recency tiebreaker — smaller index = newer.
Pop the smallest. This is the next unique key and its winning entry.
Emit it (subject to the tombstone-drop rule below).
Advance the popped cursor. Also advance every other cursor whose current key equals the just-emitted key (they are stale duplicates).
Repeat until the heap is empty.

In a min-heap with K cursors and N total entries the merge is O(N log K).

3. Newest-wins semantics

The contract:

Inputs are ordered by recency. Index 0 is newest.
For each distinct key, the first input that contains it wins.
The winning entry's type (Value or Tombstone) is preserved.
All other versions of that key are discarded.

This matches what a layered reader would do on a get() query if it walked the levels top-down and short-circuited on the first hit.

4. Tombstone purging

A tombstone exists to hide an older version of a key. It is safe to drop a tombstone if and only if there is no older version anywhere that the tombstone could be hiding.

Two cases:

Compacting the bottom level. There is nothing older. Every tombstone whose only remaining copy is in the output is safe to drop. Callers signal this with drop_tombstones=true.
Compacting a non-bottom level. Even if no input has an older version of the key, a deeper level still might. Tombstones must be kept. Callers leave drop_tombstones=false.

This lab exposes the flag and trusts the caller. A real engine wires it from the level metadata.

5. What this lab does NOT do (and why)

No splitting: the output is a single SSTable. Production engines cap output file size to keep per-file work bounded. The merge logic is the same; splitting is an output-side concern handled by switching SstWriter targets.
No level metadata: there is no notion of "this output belongs to level N". That belongs to a manifest / version-edit log, which is db-09 territory.
No deletion of obsolete inputs: the caller is responsible for unlinking the input files once the output is durable. We just return bytes.
No checksums or atomic rename: writing-then-renaming and checksumming blocks belong in db-08+.

6. Cross-language contract

The output is a db-06 SSTable. Two implementations that compact the same inputs in the same order with the same drop_tombstones flag must produce byte-identical outputs. We verify this with sha256 across rust/go/cpp.

7. Failure modes worth recognizing

Bug	Symptom
Wrong recency tiebreaker (older wins on ties)	After compaction, a recently-overwritten key reverts.
Forgetting to advance non-winning duplicates	Same key appears multiple times in output → SstWriter errors.
Comparing keys as strings (UTF-8) not bytes	Non-ASCII keys order wrong; cross-lang sha256 diverges.
Dropping tombstones when not at bottom	Deleted keys reappear from a deeper level.
Emitting an empty block instead of empty SSTable	File size ≠ 36 for empty merge; reader rejects.

8. Hand-trace template (the smallest interesting example)

Inputs (newest first):

A: [("a",V,"1"), ("b",T)]
B: [("a",V,"0"), ("c",V,"9")]

Step-by-step heap state and emit:

step	heap (key,src)	pop	emit	notes
0	(a,A) (a,B)	(a,A)	a → V "1"	also advance B past "a"
1	(b,A) (c,B)	(b,A)	b → T	tombstone preserved
2	(c,B)	(c,B)	c → V "9"	A is exhausted
3	(empty)	-	-	done

Output: [("a",V,"1"), ("b",T), ("c",V,"9")] — 3 distinct keys, A's versions of a and b win, c comes from B.

db-07 references

Foundational

O'Neil, P. et al. The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 1996. The original. Read sections 3–4 for the merge/rolling-merge mechanism.
Chang, F. et al. Bigtable: A Distributed Storage System for Structured Data. OSDI 2006. Section 5.3 ("compactions") frames minor vs. major compactions on top of SSTables.

Engineering, read these

LevelDB: https://github.com/google/leveldb/blob/main/db/version_set.cc See Compaction::IsBaseLevelForKey for the tombstone-purge rule and PickCompaction for the leveled-policy choice.
LevelDB design notes: https://github.com/google/leveldb/blob/main/doc/impl.md
RocksDB compaction overview: https://github.com/facebook/rocksdb/wiki/Compaction
Pebble (CockroachDB's Go LSM) compaction notes: https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md Pebble's range-tombstone treatment is what you graduate to after this lab.
Universal (tiered) vs leveled, with numbers: https://github.com/facebook/rocksdb/wiki/Universal-Compaction

Curriculum companions

mini-LSM Chapter 2.6 "Compaction Strategies": https://skyzh.github.io/mini-lsm/week2-06-task-types.html
Designing Data-Intensive Applications, Ch. 3 — strikes the right level of abstraction for explaining write amp / read amp trade-offs.

Algorithm

K-way merge with a min-heap: any algorithms textbook. The pattern here is identical to "merge K sorted lists" with an extra rule for duplicate keys.

db-07 Analysis

Surface area

The lab exposes one library function and one CLI:

compact(inputs: ordered list of SSTable bytes, drop_tombstones: bool) -> SSTable bytes

inputs[0] is the newest. Empty input list returns an empty SSTable (36 bytes, identical to SstWriter::new().finish() from db-06).

CLI:

compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...

State machine of the merge

The merger holds K cursors, one per input. Each cursor is a sequence of (key, entry) pairs in sorted key order, produced by iterating the input SSTable's blocks in order.

A min-heap holds at most K entries, each (current_key, source_index). source_index is the position in inputs (smaller = newer).

State transitions:

init:    push each non-empty cursor's first (key, src) into heap
step:    pop top (key=k, src=i)
         take entry from cursor i, advance cursor i
         for every other cursor j whose current key == k: advance cursor j
         re-push any cursor that still has items (only those that advanced past k)
         emit (k, entry) unless (entry is Tombstone AND drop_tombstones)
done:    when heap is empty

The "advance every cursor whose current key == k" rule is what makes the merge deduplicating. It is the only subtle bit. Forget it and SstWriter rejects the output with Unsorted because the same key reappears.

Containers per language

Rust: BinaryHeap<Reverse<(Vec<u8>, usize)>> — pop smallest by key, ties broken by source index (smaller = newer = wins). Cursors are IntoIter over pre-materialized Vec<(Vec<u8>, Entry)> from SstReader::entries().
Go: container/heap with a struct slice. Same ordering. Cursors are index counters into []Entry.
C++: std::priority_queue with custom comparator that flips to min-heap. Cursors are std::vector<...>::const_iterator pairs.

Materializing all entries up front is wasteful for huge SSTables but is fine for this lab and keeps the three implementations symmetric. A streaming reader is the next step (db-08 block-cache and iterators).

What's intentionally not optimized

We materialize entries instead of streaming blocks. This avoids needing a block-by-block iterator API on db-06's SstReader, which would couple the two labs more tightly than the curriculum wants at this stage.
We use a single output SSTable. Output splitting is one if-statement in the emit step (flush + start new SstWriter when size exceeds N). Doing it here would force a "list of outputs" API that doesn't matter for byte-identity.
We do not parallelize. K-way merge is trivially serial; partitioning is a policy concern that belongs above this layer.

What could break the cross-language byte-identity

Tiebreaker inconsistency between heap implementations. Pin it: for two equal keys, the cursor with the smaller source index wins. All three implementations must agree on this exactly.
Comparing keys as language-native strings (UTF-8 ordering). All three must compare as byte slices (Vec<u8> / []byte / std::vector<uint8_t>).
Forgetting to advance non-winning duplicates. Output will contain repeats; SstWriter from db-06 will reject with Unsorted. Good — fail loud.
Different block-target sizes. We always use the db-06 default (4096) so the output is a single block for the canonical scenario.

Verification plan in one line

Build two distinct memtables (newer + older), promote each to an SSTable using db-06, run compact [newer, older] in all three languages, then assert sha256 equality and spot-check that newest-wins applied correctly.

db-07 Execution

Library API (per language, same shape)

fn compact(inputs: &[SstReader], drop_tombstones: bool) -> Vec<u8>

inputs[0] is newest.
Returns the bytes of a db-06 SSTable.
Empty inputs → 36-byte empty SSTable.

CLI

compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...

IN1 is newest.
Output OUT.sst is byte-identical across rust/go/cpp for the same arguments.

Algorithm (pseudocode)

function compact(inputs, drop_tombstones):
  cursors = [iter(input) for input in inputs]   # each iter yields (key, entry) in key order
  heap = empty min-heap                          # entries: (key, src)
  for i, c in enumerate(cursors):
    k = c.peek()
    if k is not None: heap.push((k, i))

  out = SstWriter()
  while heap not empty:
    (k, i_win) = heap.pop()
    entry = cursors[i_win].next()                # consume winner
    if cursors[i_win].peek() is not None:
      heap.push((cursors[i_win].peek(), i_win))

    # Drain all older duplicates of the same key
    while heap not empty and heap.peek().key == k:
      (_, j) = heap.pop()
      cursors[j].next()
      if cursors[j].peek() is not None:
        heap.push((cursors[j].peek(), j))

    if entry.is_tombstone and drop_tombstones:
      continue
    out.add(k, entry)

  return out.finish()

Heap ordering: lexicographic on key; tiebreak by source index ascending (smaller index = newer = wins on equal keys).

How to wire it (per language)

Lang	Cursor	Heap
Rust	`std::vec::IntoIter<(Vec<u8>, Entry)>`	`BinaryHeap<Reverse<(Vec<u8>, usize)>>`
Go	index into `[]struct{Key,Entry}`	`container/heap` with `Less` honoring (key,src)
C++	pair of `vector<...>::const_iterator`	`priority_queue` with greater-than comparator

All three "peek a cursor's current key" is cursors[i].keys[idx_i] (or equivalent) — there is no I/O during peek; entries are materialized once.

db-07 Observation

The canonical scenario

We build two SSTables (call them newer.sst and older.sst) and compact them in the order [newer, older].

newer.sst — produced from this MemTable scenario

memtable new
memtable bulk 50            # key0..key49 -> val0..val49
memtable put  "key10" "NEW-10"
memtable del  "key5"

So newer.sst contains 50 distinct keys, of which key10 has value "NEW-10", key5 is a tombstone, and the other 48 are val<i>.

older.sst — produced from this MemTable scenario

memtable new
memtable bulk 100           # key0..key99 -> val0..val99
memtable put  "key50" "OLD-50"

So older.sst contains 100 distinct keys, of which key50 is "OLD-50" and the others are val<i>.

Expected merged output

For every key the table picks the first input that contains it:

Key range / specific key	Winner	Value
key0..key4	newer	val0..val4
key5	newer	Tombstone
key6..key9	newer	val6..val9
key10	newer	"NEW-10"
key11..key49	newer	val11..val49
key50	older	"OLD-50"
key51..key99	older	val51..val99

Total distinct keys: 100. Tombstones: 1 (key5). Values: 99.

"What broken looks like"

Bug	Symptom
Tiebreaker swapped (older wins)	key10 → "val10" instead of "NEW-10"; key5 → "val5" instead of tombstone.
Forget to drain duplicates	SstWriter::add returns Unsorted error (or "keys not strictly ascending").
Byte-vs-string comparison	Output sha256 differs across languages on ASCII-only input only if a sort breaks.
Tombstone dropped when `drop_tombstones=false`	Output has 99 keys instead of 100; key5 missing.
Tombstone kept when `drop_tombstones=true` at bot	Output has 100 keys instead of 99; key5 still present as tombstone.

With `drop_tombstones=true`

Same inputs, run as bottom-level compaction:

key5 disappears entirely (newer's only entry for key5 was a tombstone).
99 keys total, all values.

Hex of the absolute simplest compaction

Compacting [A, B] where A = [("k", T)] and B = [("k", V, "v")]:

drop_tombstones=false: output is an SSTable with one entry ("k", T). File size = 4 (block hdr) + 4+4+1 (entry hdr) + 1 (key) + 0 (value) + 4 (index count) + 4+8+8+1 (one index entry) + 32 (footer) = 79 bytes. This is the same as sstable build of a MemTable containing only ("k", T).
drop_tombstones=true: output is the empty SSTable, exactly 36 bytes.

Cross-language sha256 must match for both cases.

db-07 Verification

Ten properties, three implementations each.

V1 — Empty inputs

compact([], drop=false) → exactly 36 bytes, identical to SstWriter::new().finish() from db-06. Same for drop=true.

V2 — Single input passthrough

compact([A], drop=false) reproduces A's logical contents (same entries in same order). The bytes are not necessarily identical to A (block boundaries may differ if A had unusual block-target settings), but the output's entries() matches A's entries() exactly.

V3 — Newest wins on overlap

Inputs A = [("k", V, "new")], B = [("k", V, "old")]. Output contains ("k", V, "new") only. Output entry count = 1.

V4 — Tombstones win over older values

Inputs A = [("k", T)], B = [("k", V, "v")]. With drop=false, output contains ("k", T). With drop=true, output is empty.

V5 — Disjoint keys interleave correctly

Inputs A = [("b", V, "x"), ("d", V, "x")], B = [("a", V, "y"), ("c", V, "y")]. Output: ("a", V, "y"), ("b", V, "x"), ("c", V, "y"), ("d", V, "x") — sorted, no duplicates, every entry from its sole source.

V6 — Three-way merge handles transitive dedupe

Inputs (newest → oldest):

A: [("k", V, "v1")]
B: [("k", V, "v2"), ("z", V, "Z")]
C: [("k", V, "v3"), ("a", V, "A")]

Output: [("a", V, "A"), ("k", V, "v1"), ("z", V, "Z")]. K resolves to A's. Both B and C must advance past their "k" entries even though neither wins.

V7 — Canonical scenario byte-identity

Build newer.sst and older.sst as described in observation.md. Compact in each language with drop=false. Assert sha256 equality across all three languages.

V8 — SstWriter rejects an internally broken merge

If the merger forgets to drain duplicate cursors and tries to call SstWriter::add with the same key twice, the writer returns Error::Unsorted. The test for this constructs two inputs with overlapping keys and verifies that a correct compaction succeeds (i.e., we never see that error during a valid compaction).

V9 — Output is a valid db-06 SSTable

The bytes returned by compact open cleanly via SstReader::open and get(key) returns the merged version. This is the round-trip test.

V10 — Idempotent re-compaction

compact([compact([A, B])]) is byte-identical to compact([A, B]). Compaction of an already-compacted file is a no-op modulo metadata.

Cross-test (scripts/cross_test.sh)

Goes beyond V7 to also run a 3×3 reader/writer matrix on the merged file (byte-identity already implies this, but it confirms the output is portable):

Build newer.sst and older.sst via db-05 → db-06 (Rust binaries; db-06 already proved byte-identity).
Each language runs compact OUT.sst newer.sst older.sst.
Assert sha256 match across the three OUT files.
Each language reads each OUT file with sstable iter (from db-06) and asserts the iter output equals a reference (the Rust read of its own OUT).
Spot-check sstable get OUT.sst key10 → value: 4e45572d3130 ("NEW-10") in all three.
Spot-check sstable get OUT.sst key5 → tombstone.
Spot-check sstable get OUT.sst key50 → value: 4f4c442d3530 ("OLD-50").

db-07 Broader Ideas

What you'd add next, in order of payoff

Output splitting. Add compact_to_files(inputs, drop, target_bytes) -> Vec<Vec<u8>>. Implementation: switch SstWriter when the in-flight writer exceeds target_bytes. You must finalize at a key boundary (between two emitted entries), never inside a logical key, otherwise readers that depend on per-file key ranges will see overlaps.
Streaming block iterator on SstReader. db-06's entries() materializes everything; the compaction loop should pull one entry at a time per cursor. This is db-08 territory (block cache + iterators).
Range tombstones. A "delete all keys in [lo, hi)" record. Compaction has to track a set of active range tombstones during the merge and apply them to subsequent entries. Pebble's range-deletions doc is the reference.
Snapshot-aware tombstone purging. "Drop tombstones at bottom" becomes "drop tombstones older than the oldest live snapshot". Compaction takes a sequence-number floor and drops anything below it that has been superseded.
Leveled policy. A scheduler that picks N input files to compact based on per-level byte budgets and overlap. This is where Compaction::PickFile and IsBaseLevelForKey live in LevelDB.
Subcompactions. Splitting one logical compaction into K parallel ones by key range. Requires that the index of each input lets you cheaply find the byte range covering a key span — partitioned index helps.
Compaction throttling. When compaction can't keep up, foreground writes must stall. RocksDB exposes level0_slowdown_writes_trigger and level0_stop_writes_trigger. Without this, write bursts cause unbounded read amplification.
Universal/tiered compaction. A different scheduler; same merge mechanism. Worth implementing once leveled is in to feel the trade-off.
Per-key sequence numbers. Every key gets a monotonically-increasing seqnum; compaction picks the highest-seqnum entry for each key. This makes the merge correct under concurrent writes and snapshots. Required for MVCC (db-13).
Compaction filter callbacks. RocksDB lets the user inspect/transform every key during compaction (garbage collection of TTL'd values, schema migration). It's just a hook in the emit step.

What this lab deliberately leaves un-clean for later

No async I/O. The merge is CPU-bound on materialized vectors.
No CRCs on blocks. Bad bytes in an input produce corrupt output silently.
No fsync / atomic rename. The CLI writes the output and the script renames.
No metrics. Production engines export bytes-read, bytes-written, files-in, files-out, duration per compaction.

These are intentional. The point of this lab is the merge, not the operational surface.

db-07 Step 1 — K-way merge core

The whole lab is one algorithm. We build it in three languages, then expose it through a tiny CLI.

Cursor

A cursor is an iterator over (key, entry) pairs from one input SSTable, in key order. The simplest representation: materialize via SstReader::entries() and index into the resulting vector.

#![allow(unused)]
fn main() {
struct Cursor {
    items: Vec<(Vec<u8>, Entry)>,
    pos: usize,
}
impl Cursor {
    fn peek(&self) -> Option<&[u8]> { self.items.get(self.pos).map(|(k,_)| k.as_slice()) }
    fn take(&mut self) -> (Vec<u8>, Entry) { let i = self.pos; self.pos += 1; std::mem::take_or_clone(&self.items[i]) }
}
}

type cursor struct {
    items []entry // entry = {Key []byte; E sstable.Entry}
    pos   int
}
func (c *cursor) peek() []byte { if c.pos >= len(c.items) { return nil }; return c.items[c.pos].Key }
func (c *cursor) take() entry  { x := c.items[c.pos]; c.pos++; return x }

struct Cursor {
    std::vector<std::pair<std::vector<uint8_t>, sstable::Entry>> items;
    std::size_t pos = 0;
    const std::vector<uint8_t>* peek() const {
        return pos < items.size() ? &items[pos].first : nullptr;
    }
};

Heap entry

(key bytes, source_index)

Min-heap ordered by key ascending, ties broken by source_index ascending (smaller index = newer = wins). All three implementations must use this exact ordering for byte-identity.

Emit loop

Pseudocode is in docs/execution.md. The crucial bit is the inner drain loop:

# After emitting (k, entry):
while heap.peek().key == k:
    (_, j) = heap.pop()
    cursors[j].take()  # discard the older duplicate
    if cursors[j].peek() is not None:
        heap.push((cursors[j].peek(), j))

That loop is the only difference between "K-way merge of disjoint inputs" and "K-way merge with newest-wins dedupe".

Why the tiebreaker direction matters

Inputs are passed newest first (index 0 newest). On a tie, the smaller index must come out of the heap first. So when you build a (key, src) tuple, the smaller src is the smaller tuple, and a min-heap pops it first. No need to invert; the natural lexicographic order on the tuple does the right thing.

If you ever flip the input convention (oldest first), invert the tiebreaker. Do not do both — pick one and document it. We picked: index 0 = newest.

Try this before reading step 2

Without looking at the implementation, on paper, trace this:

A: [("a",V,"1"), ("c",V,"3")]
B: [("a",V,"old"), ("b",V,"2")]

Write out the heap after each pop. You should get four pops and three emits.

The expected output: [("a",V,"1"), ("b",V,"2"), ("c",V,"3")].

db-07 Step 2 — Tombstone purging and the bottom-level rule

A tombstone in an SSTable says: "this key was deleted; do not return any older version of it". Tombstones cost space and slow down reads (you still have to walk past them). Eventually you want to drop them.

The rule

A tombstone for key k is safe to drop during a compaction if and only if there is no older version of k anywhere in the database that the tombstone could be hiding.

Equivalently: if this compaction is over the bottom-most level and the tombstone's input is part of it, you can drop the tombstone.

For non-bottom compactions, keep all tombstones. A deeper level still has data the tombstone is suppressing.

API

compact(inputs, drop_tombstones: bool) -> bytes

The flag is the caller's promise. We do not inspect it; we trust it. In a real engine, the scheduler sets drop_tombstones = (target_level == bottom).

Implementation: one if-statement

In the emit loop, after picking the winner (k, entry):

if entry.type == Tombstone and drop_tombstones:
    continue   # skip; do not write to output
out.add(k, entry)

That's the entire change versus the basic merge.

What's still wrong (and why it's OK for this lab)

The "drop tombstones at bottom" rule is a snapshot-unaware simplification. A correct engine keeps a tombstone alive until every read snapshot older than the tombstone's sequence number has been released. Implementing that requires per-entry sequence numbers, which we add in db-13.

For this lab, "bottom" means "the caller swears nothing older exists". That is enough to demonstrate the mechanism and to write a meaningful cross-test.

Test scenarios this enables

Test	Inputs (newest first)	drop	Expected output
Tombstone wins over older value	A=[(k,T)], B=[(k,V,"x")]	false	[(k,T)]
Tombstone dropped at bottom	same	true	[] (empty SSTable, 36 bytes)
Tombstone for non-existent key kept	A=[(k,T)]	false	[(k,T)]
Tombstone for non-existent dropped	same	true	[]
Mixed values + tombstones, mid-level	A=[(a,V),(b,T)], B=[(a,V_old),(c,V)]	false	[(a,A.V),(b,T),(c,V)]
Same inputs at bottom	same	true	[(a,A.V),(c,V)]

These are V4 in the verification table and the drop_tombstones=true arm of the cross-test.

db-07 Step 3 — CLI and cross-language byte-identity

CLI shape (all three languages emit and accept the same)

compact [--drop-tombstones] OUT.sst IN1.sst IN2.sst ...

Arguments:

--drop-tombstones: optional first flag. If present, tombstones are dropped (use when this is the bottom-level compaction).
OUT.sst: output file path.
IN1.sst ...: one or more input SSTable paths. IN1 is the newest.

Exit codes:

0: success.
1: any error (open failure, malformed SSTable, write failure).
2: usage error.

The CLI is intentionally minimal. There is no JSON, no stats, no progress. Stats live in db-22 (performance + benchmarking).

The cross-test scenario

The script in scripts/cross_test.sh:

Builds feed_newer.mt (memtable scenario from observation.md, 50 keys with key10 replaced and key5 deleted).
Builds feed_older.mt (100 keys with key50 = "OLD-50").
Promotes both to SSTables using the db-06 Rust binary (sstable build feed_newer.mt newer.sst).
For each language, runs compact OUT.sst newer.sst older.sst.
Asserts sha256(rust.OUT) == sha256(go.OUT) == sha256(cpp.OUT).
Runs the 3×3 read matrix using db-06's sstable iter over each OUT.
Spot-checks sstable get OUT.sst <key> for key5, key10, key50, key99, nope.

The spot-checks use db-06's sstable CLI (not db-07's compact), which is why steps 5–7 don't need a separate db-07 reader: the output is a db-06 SSTable.

Why this proves the merge

Two SSTables with overlapping keys, where some overlaps prefer the newer version and one (key50) is unique to the older. If your merge logic gets the recency tiebreaker wrong, you read val10 instead of NEW-10. If you forget to drain duplicates, you write the same key twice and SstWriter::add throws. If you drop tombstones by mistake, key5 disappears.

If all three languages get the same sha256, the algorithm and its translation to three runtimes are pinned down.

db-08 — Block Cache and Iterators

What is it?

Two small, foundational read-path components that every LSM (and most B-tree engines) need:

Block cache — a bounded, in-memory map from (file_id, block_offset) to the decoded block bytes, evicting the least-recently-used entry when full. Sits between the SSTable reader and the OS page cache so that a hot index block or hot data block does not have to be decoded on every query.
Merging iterator — a streaming K-way merge over N pre-sorted sources (memtable, level-0 SSTables, level-1 SSTables, …) that yields each key exactly once, preferring the newest source on ties, and optionally drops tombstones. This is the engine of every LSM read: point lookups, range scans, compaction, and snapshot iteration.

Why does it matter?

In an LSM, a single user get("k") may have to consult the memtable plus 1–10 SSTables. Without a cache, every miss re-reads (and re-checksums, and re-decodes) blocks from disk; without a merging iterator, range scans cannot present a single ordered view of the live keyspace. Together these two components turn the LSM's "many small sorted runs" representation into the illusion of "one big sorted map" — and they do it without unbounded memory.

These primitives also appear far outside databases:

OS page cache is a block cache for files.
CPU L1/L2/L3 are hardware block caches keyed on physical address.
sort -m and most stream-join operators are merging iterators.
Kafka log compaction, Bigtable scans, and DynamoDB streams all do tournament-style merges across sorted inputs.

How does it work?

            ┌─────────────── BlockCache (cap = N entries) ───────────────┐
get(k)  ──► │  HashMap<(file_id, off), Node*>  +  DoublyLinkedList<Node> │
            │  hit:  splice node to front, return value                  │
            │  miss: insert at front; if full, drop the back node        │
            └────────────────────────────────────────────────────────────┘
                              │
                              ▼
            ┌─────────────── MergingIterator(sources) ───────────────────┐
            │  min-heap of (current_key, src_idx)                        │
            │  Next():                                                   │
            │    pop heap → winner                                       │
            │    advance winner src, push its next key (if any)          │
            │    while heap.top().key == winner.key:                     │
            │       pop & advance older (they are shadowed by winner)    │
            │    if drop_tombstones and winner is tombstone: continue    │
            │    yield (winner.key, winner.entry)                        │
            └────────────────────────────────────────────────────────────┘

Two invariants make this correct:

Per-source sort. Within one source, keys are strictly ascending. The heap therefore needs only the front of each source — never the full set.
Tie-break by source index. Source 0 is newest; on a tie, the newest entry wins and the older copies are drained without being yielded. This is how a put in the memtable shadows an old value in L1, and how a tombstone shadows a value of the same key in any older source.

Terminology

Block — a fixed-ish-size chunk of an SSTable (typically 4 KiB) that is the unit of disk I/O and the unit of block-cache eviction.
Cache hit / miss — was the requested key present in the cache?
Eviction — removing an entry to make room. LRU picks the entry least-recently touched (read or written).
MRU / LRU — most/least recently used end of the list.
K-way merge — merging K already-sorted sequences into one sorted sequence. Optimal comparison cost is O(N log K) for N total entries.
Tournament tree / min-heap — the data structure used to pick the next source to advance in O(log K).
Tombstone — a marker that says "this key has been deleted"; it shadows any older value for the same key until it is dropped during compaction.
Newest-wins — the LSM tie-break rule: source i < j means i is newer.

Mental models

The cache is a bounded hash map with a freshness order. The hash gives you O(1) lookup, the list gives you O(1) eviction of the stalest entry.
The merging iterator is a tournament. K runners, each in their own lane; the heap is the leaderboard; every Next() advances the current leader by one step and re-runs the comparison between the new front of that lane and the rest of the heap.
Tombstones are entries, not absences. They occupy a slot in the stream and only disappear during a compaction that is guaranteed to have seen all older versions of the same key.

Common misconceptions

"LRU is just a list." No — a list alone is O(N) per lookup. The hash map is what makes both operations O(1); the list only encodes the order.
"A merging iterator deduplicates by buffering everything." No. It inspects only the front of each source. Total memory is O(K), regardless of how many entries flow through.
"Newest-wins requires timestamps." Not in an LSM: source ordering already encodes recency (memtable > L0 > L1 > …). Timestamps are a separate concern for MVCC (db-13).
"A block cache replaces the OS page cache." It complements it. The OS caches raw file bytes; the block cache caches decoded/decompressed blocks and shortcuts the verification step (CRC checks, decompression).
"Tombstones can be dropped any time." Only during a compaction that includes the bottom level — otherwise an older live value could re-surface. See db-07 for the rule; db-08 lets the caller decide via a flag.

Talking points (interview-grade)

Why bound the cache by entries vs bytes? Entry-bounded is simpler and fine when blocks are roughly uniform (e.g., RocksDB's default 4 KiB blocks). Production systems bound by bytes (block_cache_size_mb) because compressed block sizes vary widely; we use entry count here to keep the data structure the focus of the lab.
Why a doubly-linked list and not a VecDeque? O(1) removal of an arbitrary node on hit-promotion. VecDeque only gives O(1) at the ends.
Why heap of (key, src) and not heap of full entries? Comparator cost: keys are small and comparable; entries (which may hold large values) are not. Also lets us move the entry out of the source vector with a single std::move / mem::take, avoiding copies.
Why does newest-wins also drain all older entries with that key? Otherwise the iterator would emit duplicates downstream, breaking the "exactly one entry per live key" contract that compaction and range scans depend on.
What about thread safety? Our block cache is single-writer-single-reader by design. Real systems use sharded caches (RocksDB: 64 shards) so each shard has its own mutex and contention is 1/Nth.

Connections to the rest of the curriculum

db-05 (memtable) is the newest source in every read-path merge.
db-06 (SSTable format) produces the sorted entries the merger consumes, and the blocks the cache caches.
db-07 (compaction) is itself a merging iterator with drop_tombstones=true whose output is fed to an SSTable writer. db-08 generalizes that machinery so the read path can use it for point lookups and scans as well.
db-09 (LevelDB-complete) wires this into a full Get/Scan path.
db-13 (MVCC) layers per-key snapshot filtering on top of a merging iterator like this one.

db-08 — References

Block cache / LRU

O'Neil, O'Neil, Weikum — "The LRU-K Page Replacement Algorithm For Database Disk Buffering" (SIGMOD 1993). The canonical "LRU is not the whole story" paper; explains why LRU under-performs LRU-K on database workloads.
LevelDB util/cache.cc — the reference shardless LRU used by LevelDB. Doubly-linked list + hash table; reads update recency on hit. Worth reading end-to-end; ~300 lines. https://github.com/google/leveldb/blob/main/util/cache.cc
RocksDB cache/lru_cache.{h,cc} and cache/clock_cache.{h,cc} — production-grade sharded LRU plus a clock-based variant. Demonstrates the shard-by-key-hash technique. https://github.com/facebook/rocksdb/tree/main/cache
CockroachDB Pebble internal/cache/ — a Go implementation with a modern API; useful for comparing language ergonomics. https://github.com/cockroachdb/pebble/tree/master/internal/cache
Postgres src/backend/storage/buffer/ — the canonical relational buffer pool: clock-sweep replacement with usage counts. Different policy, same role.

Merging iterators

Knuth, TAOCP Vol. 3 §5.4.1 — "Multiway Merging and Replacement Selection". The original analysis of K-way merge using a tournament tree and a loser tree.
LevelDB table/merger.cc and table/iterator.h — the canonical read-path merging iterator interface, plus the heap-based merger that combines memtable + level-0 + level-N+ iterators. https://github.com/google/leveldb/blob/main/table/merger.cc
RocksDB table/merging_iterator.{h,cc} — extended with range tombstones and pinned iterators. Shows how the interface evolves under production pressure.
Pebble internal/manifest/level_iter.go + merging_iter.go — a Go flavor with explicit handling of range deletes.

Background reading

Designing Data-Intensive Applications, Ch. 3 ("Storage and Retrieval"), pp. 70-89. Kleppmann's tour of LSM read amplification, bloom filters, and the role of the block cache.
Petrov, "Database Internals", Ch. 7 ("Log-Structured Storage"). Covers caching, iterators, and tombstone semantics at the level we implement.

Lab-specific notes

The canonical byte layout used by merge_iter is documented in src/rust/src/lib.rs on SerializeStream. It is deliberately minimal — its only job is to give us a byte-identical cross-language fingerprint for the sha256 check.
The cross-test reuses the same newer.mt/older.mt feedstock as db-07 so the two labs can be compared side-by-side. Their sha256s will differ because db-07 emits a full SSTable (with index, footer, padding) while db-08 emits a flat entry-stream, but the underlying ordering is identical.

db-08 — Analysis

Problem statement

We need two read-path primitives that the rest of the LSM stack assumes exist:

A bounded in-memory cache that lets us amortize the cost of decoding SSTable blocks across many lookups, with predictable memory usage and O(1) operations.
A streaming K-way merging iterator that exposes N pre-sorted sources as a single sorted, deduplicated stream — newest-wins on tie — without buffering all entries in memory.

Both must be small, dependency-light, and byte-deterministic when serialized (so the cross-language cross-test can detect any divergence).

Constraints

Determinism. Given identical inputs, the merge stream's serialized bytes must be identical across Rust, Go, and C++. This is the cross-test's only gate.
Bounded memory. The cache must cap at a user-supplied entry count; the iterator must use O(K) working set regardless of the number of entries.
No backtracking. The iterator is streaming: it must work on inputs that arrive lazily.
Newest-wins is strict. Source index 0 always wins. There are no timestamps, generations, or sequence numbers — that complexity is deferred to db-13 (MVCC).

Decisions

Cache eviction policy: LRU. Simple, predictable, well-understood. Not the best policy for all workloads (LRU-K, ARC, and CLOCK-Pro all beat it on scan-heavy workloads), but the correct teaching baseline.
Cache capacity unit: entries. Production systems bound by bytes; we use entries to keep the data structure (rather than the accounting) the focus.
Heap element shape: (key, source_index). Small and cheap to compare. Pulling the full entry into the heap would inflate comparator cost and force copies.
No timestamps / sequence numbers. Newest-wins is by source index alone.
Tombstone drop is opt-in. Callers pass drop_tombstones=true only when they have proven (via compaction rules — see db-07) that no older source could resurrect the deleted key.

Trade-offs

Choice	Pros	Cons
LRU (vs LRU-K, ARC, CLOCK)	O(1) ops, simple to reason	Scan-pollutes — one big scan can flush hot entries
Doubly-linked list (vs VecDeque)	O(1) arbitrary removal	Heavier per-node memory (two pointers)
Heap of `(key, src)` (vs entry)	Cheap compares, no copies	Indirection back to source vector on every pop
Entry-bounded cap (vs byte)	Simple, no per-block sizing	Memory usage depends on block-size distribution
Drain-on-tie eagerly	Caller never sees duplicates	Slight extra work even when caller would dedupe

Risks

Heap ordering bug on tie. If the (key, src) comparator forgets to break ties on src ascending, the merger silently emits the older value on key collisions. The "newest-wins" test catches this on a 2-entry input.
Cache eviction at boundary. Inserting into a full cache and then immediately calling Get on the just-evicted key must miss, not hit.
Iterator reentrancy. Calling Next after end-of-stream must keep returning end-of-stream, not panic.
Cross-language drift on serialization. Endianness or length-prefix width mismatches would invalidate the sha256. We pin to "u32 LE length + bytes + u8 type [+ u32 LE val_len + val]".

Out of scope

Compression (RocksDB caches decompressed blocks; some configs cache both).
Pin/unpin handle protocol for zero-copy reads.
Snapshot/sequence-number-aware iteration (deferred to db-13).
Range deletes / range tombstones (deferred to db-21).
Block-cache statistics beyond hit/miss/evict.

db-08 — Execution

Build order

Rust first: drives the canonical data-shape decisions (the Entry enum from db-06, the byte format of SerializeStream).
Go second: ports the same algorithm with native data structures (container/list, container/heap).
C++ third: same algorithm with std::list + std::priority_queue.

After all three pass their own unit tests, we run scripts/cross_test.sh which builds canonical input SSTables via db-05 + db-06 and asserts that all three merge_iter binaries produce the same sha256.

Per-language layout

Rust (src/rust)

Cargo.toml pulls in db-06's sstable crate by path = "../../../db-06-sstable-format/src/rust".
src/lib.rs defines BlockCache, MergingIterator, SerializeStream, and re-exports sstable::Entry. The cache uses a HashMap of slot indices plus a Vec<Node> arena with embedded prev/next indices and a free-list — an arena-based intrusive list, which beats LinkedList<T> on allocator pressure.
src/bin/merge_iter.rs is the cross-test CLI.

Go (src/go)

go.mod is module github.com/10xdev/dse/db08 with replace directives pointing to db-05 and db-06 on disk.
lru.go uses container/list.List and map[BlockKey]*list.Element.
iter.go uses container/heap with a []heapItem backing slice.
cmd/merge_iter/main.go is the CLI.

C++ (src/cpp)

CMakeLists.txt directly compiles db-06's sstable.cc into our sstable_lib rather than add_subdirectorying db-06 — that would leak db-06's add_test registrations into our ctest.
lru.{h,cc} uses std::list<Node> + std::unordered_map; Get uses list_.splice(begin, list_, it->second) for O(1) MRU promotion.
iter.{h,cc} uses std::priority_queue<HeapEntry, std::vector, Greater>.
src/merge_iter_bin.cc is the CLI.

Verification

scripts/verify.sh builds + tests all three languages.
scripts/cross_test.sh builds db-05 + db-06 input pipelines, generates the same newer.sst / older.sst used by db-07, runs merge_iter in all three languages, and asserts sha256 byte-identity for both drop_tombstones=false and drop_tombstones=true. It also spot-checks that "NEW-10", "OLD-50", "val99" appear in the stream and that the key5 tombstone framing (040000006b65793501) appears exactly when expected.

Reproducible cross-test sha256 (this lab's truth)

drop=false  → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true   → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)

The 9-byte difference is exactly the framing of one tombstone entry: u32_le(4) + "key5" + u8(1) = 4 + 4 + 1 = 9 bytes.

What you should be able to do after this lab

Sketch an LRU on a whiteboard in under three minutes and explain why both the hash map and the list are necessary.
Explain why a K-way merge uses a heap and not nested merges, and quote the O(N log K) comparison bound.
Identify, in any storage codebase, where the "newest-wins on tie" rule is enforced and where the "drain duplicates" step happens.
Argue when it is safe to drop a tombstone during iteration vs when it is not.

db-08 — Observation

What we measured (functional)

11 Rust unit tests pass (cargo test): 4 LRU + 6 merger + 1 serializer.
11 Go unit tests pass (go test ./...): 4 LRU + 6 merger + 1 serializer.
2 C++ ctest binaries pass (test_lru covers 4 cases, test_iter covers 7).
Cross-language sha256 match in both modes (see execution.md).

Anatomy of the output stream

For the canonical input (newer = bulk 50 + put key10=NEW-10 + del key5; older = bulk 100 + put key50=OLD-50) the entry count is 100 with drop=false and 99 with drop=true. Total byte-count for the output stream:

drop=false: 1874 bytes
drop=true : 1865 bytes (delta 9 = exactly one tombstone frame)

Each value entry costs 4 (key_len) + len(key) + 1 (type) + 4 (val_len) + len(val) bytes. For our scenario, most keys are keyN (3-5 bytes) with values valN (4-5 bytes), making the per-entry frame ~17-19 bytes.

Hit/miss behavior under repeated workloads

The lru_basic_hit_miss test demonstrates the basic counters: one Get on a present key bumps hits to 1; one Get on an absent key bumps misses to 1. The lru_evicts_lru_on_capacity test confirms that the eviction counter increments exactly once when a fourth insert into a 3-slot cache forces the LRU node out.

Tournament dynamics

With K = 2 sources in the cross-test, the heap has at most 2 entries; with K = 7 (memtable + L0 file + 5 L1 files in a realistic LSM), the heap has at most 7 entries regardless of the millions of entries flowing through. Heap operations are O(log K) per Next(), so even at K = 1000 the per-entry cost is ~10 comparisons.

Determinism

The serialize_is_deterministic_and_sized test in all three languages constructs the same (key, entry) stream twice and confirms identical serialized bytes. This is what the cross-test relies on — if any language becomes non-deterministic (e.g., picks the wrong duplicate on a tie, or serializes value lengths in big-endian), the sha256 mismatch will surface immediately.

What surprised me

The C++ std::priority_queue is min-heap-by-default only if you pass an explicit std::greater-style comparator. Forgetting this gives a max-heap that emits keys in reverse order.
Rust's BinaryHeap is max-heap-by-default; we wrap in Reverse((key, src)) to flip it, which also automatically gives the correct tie-break on src ascending because Reverse(a) < Reverse(b) iff a > b and the derived tuple Ord compares lexicographically.
Go's container/heap requires you to write Less yourself, so the tie-break is explicit and self-documenting.

What did not surprise me

The hit/miss counts came out exactly as expected on first run for all three languages. The K-way merge produced a sorted stream on first run for Rust and Go.

db-08 — Verification

Cross-language byte identity (gating)

scripts/cross_test.sh is the gate. It builds canonical SSTable inputs and runs each language's merge_iter binary, comparing sha256 of the serialized merge stream.

Final results from the current run:

drop=false:
  rust: f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
  go  : f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
  cpp : f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
  match: f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1

drop=true:
  rust: ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
  go  : ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
  cpp : ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)
  match: ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e

The 9-byte size delta between modes equals exactly one tombstone frame (u32_le(4) + "key5" + u8(1)), confirming that the only entry dropped is the expected one.

Stream-content spot-checks

The cross-test runs xxd -p | grep to confirm that:

NEW-10 (hex 4e45572d3130) appears — the merged-write semantics worked.
OLD-50 (hex 4f4c442d3530) appears — keys present only in the older source survive.
val99 (hex 76616c3939) appears — the largest bulk key from older shows up.
040000006b65793501 (key5 tombstone framing) appears with drop=false and is absent with drop=true.

These are not redundant with the sha256 check: sha256 mismatch tells you something is wrong but not what; the framed-hex grep tells you which invariant broke.

Unit-test coverage matrix

Behavior	Rust	Go	C++
LRU basic hit/miss + counters	✅	✅	✅
LRU evicts LRU on capacity	✅	✅	✅
LRU re-insert overwrites + promotes	✅	✅	✅
LRU MRU-first key order after Get	✅	✅	✅
Merger: empty inputs → empty output	✅	✅	✅
Merger: single source passthrough	✅	✅	✅
Merger: two-source interleave (no duplicates)	✅	✅	✅
Merger: newest-wins on tie	✅	✅	✅
Merger: tombstone kept when drop=false	✅	✅	✅
Merger: tombstone dropped when drop=true	✅	✅	✅
SerializeStream deterministic & expected size	✅	✅	✅

How to re-verify locally

cd db-08-block-cache-and-iterators
bash scripts/verify.sh         # unit tests for all three languages
bash scripts/cross_test.sh     # cross-language byte-identity test

What would invalidate this proof

Changing SerializeStream's framing (lengths, endianness, type-byte encoding) — sha256 would diverge immediately.
Changing the (key, src) heap comparator to break ties on src descending — newest-wins test fails before cross-test runs.
Changing the cache capacity unit from entries to bytes — the LRU tests would need recalibration but no other lab depends on the unit choice.

db-08 — Broader Ideas

What this lab teaches that goes beyond storage

The two primitives in this lab — bounded caches and tournament merges — are load-bearing in every layer of computing, not just databases.

Bounded caches

CPU caches (L1/L2/L3) implement set-associative LRU/PLRU in hardware with the same hash-map-plus-recency-order shape, just expressed in gates.
Page tables and TLBs are caches over the virtual-to-physical mapping; they share LRU's vulnerability to large scans.
HTTP caches (CDN edges, browser caches) cache responses keyed on URL with the same eviction problem and many of the same policies (LRU, LFU, TinyLFU, S3-FIFO).
Compiler caches (ccache, sccache, Bazel's CAS) cache build outputs keyed on the content hash of inputs — same data structure, different key.
JIT method caches in V8 and HotSpot cache compiled code; they too evict on capacity pressure.

The pattern is universal: bounded random-access store + recency or frequency order. Once you can implement and analyze LRU, you can swap in LFU, ARC, LRU-K, 2Q, CLOCK-Pro, TinyLFU, S3-FIFO, or W-TinyLFU by replacing the order without changing the index.

K-way merges

External sort (sort -m, MapReduce shuffle, Spark's sort-shuffle) is literally a K-way merge of sorted runs, identical in structure to ours.
Stream-stream joins (Flink, ksqlDB, Materialize) merge two ordered streams by key with a sliding-window predicate.
Time-series databases (Prometheus, InfluxDB, VictoriaMetrics) merge sorted chunks across files, then deduplicate by timestamp — newest-wins, with timestamp as the tie-breaker instead of source index.
Git's pack-objects merges sorted delta chains across pack files when serving a fetch.
Snapshot iteration in MVCC databases is a merging iterator with a per-key filter that drops versions newer than the snapshot's commit timestamp — exactly what db-13 will build on top of db-08.

"When does this break?"

LRU + scans. A long sequential scan pollutes the cache with entries the workload will never see again. Mitigations: scan-resistant policies (LRU-K, ARC), separate cache pools per access pattern, or O_DIRECT bypass.
K-way merge with very large K. When K approaches thousands (e.g., a Cassandra node with many SSTables on disk), O(log K) per-entry cost starts to bite. The fix is not a better merger but a compaction policy that keeps K bounded (db-07's job).
Tombstones outliving the keys they shadow. A delete-heavy workload produces tombstones faster than compaction can drop them; the merger spends increasing CPU skipping shadowed entries. Cassandra calls this "tombstone hell" and ships a tombstone_warn_threshold.
Cache stampede. Many threads simultaneously missing on the same key hammer the underlying storage; production systems add per-key locks ("singleflight" in Go's groupcache).

Extensions worth attempting

Sharded LRU. Replace the single cache with N independent shards keyed on hash(file_id, offset) % N.
TinyLFU admission filter in front of the cache (frequency sketch admits only entries seen more than once).
Block-cache statistics beyond hits/misses/evicts: per-entry size, bytes resident, age histogram, top-N hot blocks.
Bidirectional iterator. Add Prev() to support reverse range scans.
Range-tombstone aware merger. Adding range deletes changes heap-pop semantics: a range tomb shadows a range of point keys.
O(1) amortized doubly-linked list arena (Rust) that interns BlockKey to u32 indices to halve hash map memory.

Where this lab fits in the curriculum

After db-08, every later lab gets a free ride on these primitives:

db-09 wires BlockCache and MergingIterator into the LevelDB-complete Get/Scan paths.
db-13 (MVCC) layers snapshot-visibility filtering on top of a merging iterator just like this one.
db-14 (indexes / query optimization) builds secondary merging iterators for index-scan-then-fetch plans.
db-20 (distributed KV store) shards block caches across nodes and adds a network-aware admission policy.

Step 01 — LRU Block Cache

Goal

Implement a bounded, O(1) LRU cache keyed on (file_id, block_offset) holding decoded block bytes, with hit/miss/eviction statistics.

Spec

API (Rust signature; the Go and C++ APIs mirror it):

#![allow(unused)]
fn main() {
pub struct BlockKey { pub file_id: u64, pub offset: u64 }
pub struct CacheStats { pub hits: u64, pub misses: u64, pub evictions: u64 }

impl BlockCache {
    pub fn new(capacity: usize) -> Self;                       // capacity > 0
    pub fn get(&mut self, k: &BlockKey) -> Option<Vec<u8>>;    // promotes to MRU on hit
    pub fn insert(&mut self, k: BlockKey, v: Vec<u8>) -> bool; // returns true on overwrite
    pub fn len(&self) -> usize;
    pub fn capacity(&self) -> usize;
    pub fn stats(&self) -> CacheStats;
    pub fn keys_mru_to_lru(&self) -> Vec<BlockKey>;            // test-only
}
}

Behavior contracts:

get returns Some(v.clone()) on hit and moves that entry to MRU; bumps hits counter.
get returns None on miss; bumps misses counter.
insert on an existing key overwrites the value and promotes to MRU.
insert on a full cache evicts the LRU entry first; bumps evictions.
keys_mru_to_lru() returns the live keys in order; used by tests only.

Acceptance

cd src/rust && cargo test
cd src/go   && go test
cd src/cpp  && cmake -B build && cmake --build build && ctest --test-dir build

All four LRU tests pass in each language:

lru_basic_hit_miss
lru_evicts_lru_on_capacity
lru_reinsert_overwrites_and_promotes
lru_keys_order_mru_first

Discussion prompts

Why does get need a &mut self and not just &self? (Because it mutates the recency order, even though it only "reads" the cached value.)
What changes if you bound by total bytes instead of entries? (You need to weigh each entry on insert and evict in a loop until under cap.)
How would you make this thread-safe with minimum contention? (Sharded by hash(key) % N, one mutex per shard.)

Step 02 — Merging Iterator

Goal

Implement a streaming K-way merging iterator over N pre-sorted (key, Entry) sources, where source index 0 is newest and ties are won by smaller index. Support an optional drop_tombstones flag.

Spec

API (Rust signature):

#![allow(unused)]
fn main() {
pub struct MergingIterator { /* … */ }

impl MergingIterator {
    pub fn new(sources: Vec<Vec<(Vec<u8>, Entry)>>, drop_tombstones: bool) -> Self;
}

impl Iterator for MergingIterator {
    type Item = (Vec<u8>, Entry);
    fn next(&mut self) -> Option<Self::Item>;
}
}

Behavior contracts:

Each source is sorted strictly ascending by key with no within-source duplicates (caller's responsibility).
The merged stream is sorted strictly ascending; each key appears at most once.
On tie, source with smaller index wins; older sources are drained — i.e. their copies of that key are advanced past, not yielded.
If drop_tombstones=true, winning entries whose type is Tombstone are not yielded; the iterator continues to the next key.
The working set is O(K) regardless of N.

Canonical serialization (cross-test contract)

For each yielded (key, entry):

u32_le(len(key)) || key                                          // 4+|key| bytes
u8(type)                                                          // 1 byte; 0=Value, 1=Tombstone
if type == Value:
    u32_le(len(value)) || value                                  // 4+|value| bytes

This is what SerializeStream emits and what the cross-test sha256s.

Acceptance

cd src/rust && cargo test
cd src/go   && go test
cd src/cpp  && ctest --test-dir build

Six (Rust/Go) or seven (C++) merger tests must pass:

empty inputs → empty output
single source passthrough
two-source interleave with no duplicates
newest-wins on tie
tombstone kept when drop=false
tombstone dropped when drop=true
(SerializeStream deterministic & expected size)

Discussion prompts

Why not nested two-way merges? (Total work would be O(N · K) instead of O(N log K); for K=10 that's 3.3× worse and gets worse with K.)
Why is "drain duplicates" eager rather than lazy? (Lazy would force the caller to dedupe, breaking the invariant that the merger's output is the single source of truth for "what's live at this key".)
Where in real systems do you find tie-break-by-source-index? (LSM read path, time-series chunk merging, Kafka log compaction, anywhere "newer wins" without explicit timestamps.)

Step 03 — CLI and Cross-Language Test

Goal

Wrap the MergingIterator in a CLI binary (merge_iter) that reads N SSTables (built by db-06), runs a merge, and writes the canonical serialized stream to stdout. Then prove the three language implementations are byte-identical with scripts/cross_test.sh.

CLI spec

merge_iter [--drop-tombstones] IN1.sst IN2.sst ...

IN1 is the newest source; INk is the oldest.
Reads each input via db-06's SstReader, converts to Vec<(Vec<u8>, Entry)>, feeds all into a MergingIterator, calls SerializeStream, and writes the bytes verbatim to stdout.
Exit code: 0 success, 1 input error, 2 usage error.

Implementations:

Acceptance

Run the cross-test:

bash scripts/cross_test.sh

It must:

Print match: lines with sha256s that are the same for all three languages (in both drop=false and drop=true modes).
Confirm via hex spot-check that NEW-10, OLD-50, and val99 are present in the stream.
Confirm the key5 tombstone framing (040000006b65793501) appears with drop=false and is absent with drop=true.
End with CROSS-TEST OK.

Captured truth (current run):

drop=false  → f693c483ef39dfef8e6285e29f9051a57e60bf2c4ba7b45bbf552c7932687fd1 (1874 bytes)
drop=true   → ec71c56c89f451d33e58697af2d7bce985069078e1c599cc42062dfbba6e250e (1865 bytes)

Discussion prompts

Why pipe the binary stream into sha256sum rather than diff the entry list? (A bytewise hash catches all serialization differences with a single number; it is the strongest possible equivalence test.)
The drop=true output is exactly 9 bytes shorter than drop=false. Where do those 9 bytes go? (u32_le(4) + "key5" + u8(1) = 4+4+1 = 9 — one tombstone frame.)
If you wanted to add a new entry kind (say, a "merge-add" delta), what would you change in the serialization? (Pick a new type byte (e.g. 2), decide its payload framing, document it, and update all three languages' SerializeStream and CLI in lockstep.)

db-09 — LevelDB Complete

What is it?

A small but end-to-end LSM-tree key-value store assembled from the labs we have built so far. It is the smallest interesting "real database" we can ship: opens a directory, durably accepts put/delete/batch writes, answers get/scan queries, and survives crashes — using the WAL (db-03), MemTable (db-05), SSTable format (db-06), and merging iterator (db-08) as its parts.

The engine deliberately stops short of automatic background compaction. That arrives in db-21; here we keep the focus on correctness of the read path across multiple immutable L0 SSTables and a live memtable.

Why does it matter?

This is the first lab where the labels start to look like the things you actually run in production:

Db::open(dir) + MANIFEST — every LSM-shaped store (LevelDB, RocksDB, Pebble, Cassandra's SSTable subsystem, HBase HFile) has exactly this contract: a directory is the database, a manifest enumerates which files are live, and recovery rebuilds in-memory state by reading the manifest and replaying the WAL.
The write path's three steps — encode batch → append+fsync WAL → apply to memtable — is the universal LSM commit. Almost every storage engine on Earth does these three things in this order. Once you internalize why (durability before visibility), you can read any LSM source tree.
The read path — memtable then newest SSTable then older SSTables, with the first hit (Value or Tombstone) winning — is the core LSM invariant. Compaction in db-21 is just amortizing this work; it doesn't change the rule.

If you understand this lab, you understand the shape of LevelDB.

How does it work?

                 ┌────── Db (one directory) ─────────────────────┐
                 │                                                │
   write path    │  WriteBatch ─► encode ─► WAL.append + fsync   │
   ───────────►  │                            │                  │
                 │                            ▼                  │
                 │                      MemTable (in RAM)        │
                 │                            │                  │
                 │                  size/explicit Flush          │
                 │                            ▼                  │
                 │   sst-NNNNNN.sst.tmp ─► fsync ─► rename       │
                 │                            │                  │
                 │            prepend (id, SstReader) to L0      │
                 │            rewrite MANIFEST atomically        │
                 │            close+delete+reopen WAL            │
                 │                                                │
   read path     │  Get(k):  MemTable → L0[0] → L0[1] → …        │
   ───────────►  │           first hit wins; Tombstone ⇒ None    │
                 │                                                │
                 │  Scan:    MergingIterator over                 │
                 │             [MemTable, L0[0], L0[1], …]       │
                 │             drop_tombstones=true               │
                 └────────────────────────────────────────────────┘

On-disk layout

<dir>/
  MANIFEST            text; one "L0 <id>" line per live SSTable, newest first
  wal.log             db-03 WAL of WriteBatch records (binary)
  sst-000001.sst      db-06 SSTable, one per flush, zero-padded 6-digit id
  sst-000002.sst
  ...

Recovery (`Db::open`)

mkdir -p the directory.
Read MANIFEST line by line; each line is L0 <id> newest-first.
Open each SSTable in that order; track max_id.
Open the WAL with WalIter and replay every record (WriteBatch::decode then apply to a fresh memtable). Any torn tail is dropped by the WAL iterator (db-03 invariant).
Open the WAL for writes; set next_id = max_id + 1.

Write path

put(k, v) ≡ Write(WriteBatch{Put{k,v}})
del(k)    ≡ Write(WriteBatch{Delete{k}})

Write(batch):
    bytes  = batch.encode()
    wal.append(bytes); wal.sync()    # durability first
    apply(batch, memtable)           # then visibility

The batch wire format is identical in-memory and on-WAL:

u32 LE count
  for each op:
    u8 type          # 0 = Put, 1 = Delete
    u32 LE klen
    key bytes
    if Put:
      u32 LE vlen
      value bytes

This is the same encoder/decoder Rust, Go, and C++ all use, which is what makes the cross-language byte-identity test possible.

Flush

Flush():
    if memtable empty: return
    id = next_id++
    build SstWriter from memtable.sorted()
    write sst-id.sst.tmp; fsync; rename → sst-id.sst   # crash-safe publish
    prepend (id, SstReader) to ssts                   # newest first
    rewrite MANIFEST atomically (tmp + rename)
    wal.close(); remove(wal.log); wal = Wal::open(wal.log)
    memtable = MemTable::new()

The order matters: the SST must be durably renamed before we rewrite MANIFEST, and MANIFEST must be durably renamed before we truncate the WAL. If we crash between any two steps, recovery is safe — either the WAL still has the records, or the SST is on disk and listed in MANIFEST.

Read path

Get(k) walks the in-RAM memtable first, then SSTables newest-first. The first hit wins:

Source hit returns	Result
`Value(v)`	`Some(v)`
`Tombstone`	`None`
miss	continue

If nothing matches, return None.

ScanAll() and SerializeView() reuse db-08's MergingIterator. The memtable is materialized into a KeyEntry vector (already sorted by BTreeMap or std::map), then the iterator merges it with each SSTable's entries, preferring the newer source on ties (memtable beats L0[0] beats L0[1] ...).

What's intentionally out of scope

Compaction — db-21. Without it, repeated overwrites of the same key accumulate as more L0 SSTables. Reads stay correct (newest wins) but per-Get work grows linearly in flush count.
Snapshots / MVCC — db-13.
Sharding, replication, sequence numbers — db-16+.
Bloom filters per SSTable — built in db-04; wiring is a db-21 optimization (skip SSTables whose Bloom rejects the key).

Cross-language invariant

All three implementations expose a dbctl --dir DIR CLI that reads a script from stdin (PUT k v, DEL k, FLUSH, DUMP, DUMP_WITH_TOMBS). scripts/cross_test.sh drives the same script through each, performs a crash/recover cycle by closing and reopening the database, then compares sha256(DUMP) and sha256(DUMP_WITH_TOMBS) across Rust, Go, and C++.

A byte-identical DUMP after recovery proves that all three implementations agree on: the WAL record format, the SSTable format, the MANIFEST format, the merge ordering, the tombstone semantics, and the recovery procedure.

db-09 — References

Primary sources

Sanjay Ghemawat & Jeff Dean, LevelDB design document, Google, 2011. https://github.com/google/leveldb/blob/main/doc/index.md
LevelDB implementation notes, doc/impl.md — describes the on-disk layout (MANIFEST, log files, table files) and the recovery procedure that db-09 mirrors almost verbatim. https://github.com/google/leveldb/blob/main/doc/impl.md
Patrick O'Neil et al., The Log-Structured Merge-Tree (LSM-Tree), Acta Informatica 33, 1996. The original paper; introduces the C0/C1 levels, the merge step, and the amortized write-cost analysis.

LevelDB (Google) — direct ancestor of our design. https://github.com/google/leveldb
RocksDB (Facebook/Meta) — adds column families, leveled compaction, bloom filters per file, and many tuning knobs but keeps the same WAL → memtable → flush → L0 → … shape. https://github.com/facebook/rocksdb/wiki
Pebble (CockroachDB) — RocksDB-compatible engine in Go; very readable. https://github.com/cockroachdb/pebble
HBase HFile / Cassandra SSTables — same on-disk philosophy.

Read-path correctness

Mark Callaghan, Read, write & space amplification, 2018 — explains why the "newest source wins" rule is required and how compaction trades read amplification for write amplification. https://smalldatum.blogspot.com/2018/09/read-write-and-space-amplification.html
Pebble's docs/rocksdb.md for an excellent diff-style walkthrough of how a modern engine differs from LevelDB while preserving the same correctness invariants.

Crash safety

Pillai, Chidambaram, et al., All File Systems Are Not Created Equal, OSDI '14. The "fsync the file, then fsync the directory" rule we follow for SST publish and MANIFEST rewrite comes from this work.

Cross-lab dependencies

db-03 (WAL) — record framing, torn-tail tolerance, WalIter.
db-05 (MemTable) — sorted map with explicit tombstones.
db-06 (SSTable) — on-disk sorted-string file format with a footer and trailing checksum.
db-08 (BlockCache + MergingIterator) — k-way merge with newer-source-wins and optional tombstone dropping; canonical SerializeStream used by Db::serialize_view.

db-09 — Analysis

We are stitching together db-03/05/06/08 into the smallest engine that deserves the name database. The hard part is not any single component — we already have all of them — but choosing the smallest set of design decisions that yields crash safety and cross-language byte-identity.

Required invariants

Durability of put — once put returns, a crash must not lose the write. Achieved by WAL append + fsync before applying to the memtable.
Atomic publish of an SSTable — a recovering process must see either the complete SST or none of it. Achieved by write(.tmp) → fsync → rename. (POSIX rename is atomic with respect to crash.)
Atomic publish of a flush — a recovering process must not see an SST that MANIFEST does not list, and must not see MANIFEST listing an SST that does not exist. Achieved by ordering: write SST → rename SST → rewrite MANIFEST → rename MANIFEST → truncate WAL. A crash between SST-publish and MANIFEST-rewrite leaks an unlisted SST file (harmless and reusable on the next flush via a higher id; we keep it simple and never reuse). A crash between MANIFEST-rewrite and WAL-truncate replays records that are already in the SST — MemTable::put is idempotent for the same key, so this is safe (the duplicate disappears on next flush).
Read precedence — for any key k, the answer must come from the most recent writer. Order: memtable first, then SSTables in newest-to-oldest order. Tombstones count as a hit.
Cross-language determinism — given the same input script, all three languages must produce byte-identical DUMP. Achieved by sharing exactly the formats defined in db-03/05/06/08 plus the WriteBatch wire format defined in this lab.

Design decisions

Why MANIFEST is plain text

LevelDB's MANIFEST is a binary record log of edits ("add file X to level Y", "delete file X", "set next file number to N", ...). That makes log replay fast but is not byte-identity-friendly across languages because each edit record carries varint-encoded fields and an internal version-edit format.

For this lab the live set is small (one process, no concurrent writers, no compaction) so we use the simpler representation: a text file rewritten on every flush, atomic by tmp+rename. The cost is one extra O(n) write per flush where n = number of live SSTables. For our small in-process loads, this is invisible. db-21 will replace it with the LevelDB-style edit log when compaction needs incremental atomic edits.

Why one SSTable per flush

LevelDB also writes one SST per flush; that's why they're "L0" files (level 0 is the only level where files may overlap). We keep the same property. "Newest L0 wins" then degenerates from a level-aware rule to a simple position-in-MANIFEST rule.

Why no compaction in db-09

Compaction is a separate concern: it's a background process whose only job is to reduce read amplification and reclaim space. Skipping it means:

Read cost grows linearly with flush count for keys that miss everything.
Disk usage grows monotonically — overwrites and deletes are never reclaimed.

Neither breaks correctness, and both are exactly what db-21 will fix. Splitting them keeps each lab small enough to fully verify.

Why the WriteBatch wire format is reused as the WAL record

Two formats are strictly worse than one: more surface area, more chances for a Rust/Go/C++ encoder to diverge. The batch encoder is the WAL serializer. The WAL framing (record-length + CRC32) is db-03's concern; the contents of each record is a single encoded batch.

Why three languages

The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations, the probability that a bug produces matching sha256s is vanishingly small, so a match line is a near-proof of correctness for the encode + flush + recover + read pipeline.

db-09 — Execution

What we built, in the order we built it.

1. Rust (`src/rust`)

Cargo.toml declares crate leveldb09 (lib) and a binary dbctl.
path dependencies to db-03-write-ahead-log, db-05-lsm-memtable, db-06-sstable-format, and db-08-block-cache-and-iterators. No network-fetched deps.
src/lib.rs defines Db, WriteBatch, Op, OpType, and re-uses the upstream types directly (wal::Wal, memtable::{MemTable, EntryType}, sstable::{SstReader, SstWriter}, blockcache::{MergingIterator, SerializeStream, EntriesFromReader}).
11 inline #[cfg(test)] tests covering: batch round-trip, batch trailing- byte rejection, memtable put/get, delete-shadows-value, flush+memtable cleared, flush+recovery, WAL replay, newest-SST-wins, scan dedupe and tombstone drop, deterministic serialize_view, recovery with both an SST and a non-empty WAL tail.
bin/dbctl.rs is a stdin-driven CLI used by the cross-language script.

2. Go (`src/go`)

go.mod module github.com/10xdev/dse/db09 with replace directives pointing at the sibling labs' Go modules.
db.go ports the Rust API one-for-one. The WriteBatch wire format is byte-for-byte identical (u32 LE count, then per op: type byte, u32 LE klen, key, optional u32 LE vlen + value).
db_test.go mirrors all 11 Rust tests.
cmd/dbctl/main.go is the matching CLI.

3. C++ (`src/cpp`)

CMakeLists.txt compiles upstream .cc files directly into local static libraries (wal_lib, memtable_lib, sstable_lib, blockcache_lib). We do not add_subdirectory(../../../db-NN) because that would leak the upstream lab's add_test calls into our ctest.
src/db.h and src/db.cc provide db09::Db, constructed via Db::Open(dir) -> std::unique_ptr<Db>. Db is non-copyable and non-movable (its dse::wal::Wal member is itself non-copyable, and exposing moves would require fiddly forwarding for little gain in a one-process toy).
WAL move-assignment (wal_ = dse::wal::Wal::Open(path)) is what makes the post-flush WAL reset work; this required confirming the upstream header declares Wal& operator=(Wal&&) noexcept.
src/dbctl.cc and tests/test_db09.cc mirror their Rust/Go siblings. The test file uses #undef NDEBUG before <cassert> to guarantee asserts fire under Release builds.

4. Scripts

scripts/verify.sh builds and runs each implementation's unit tests.
scripts/cross_test.sh:
1. Builds Rust/Go/C++ dbctl binaries.
2. Defines one canonical command script (run.script) covering multi- flush, overwrites that land in newer SSTables, tombstones, and a non-empty WAL tail.
3. For each language: pipes run.script into dbctl --dir db-LANG (writes + close), then reopens the same dir and pipes DUMP and DUMP_WITH_TOMBS into separate files. Reopen forces a real WAL replay and SST reload path.
4. Computes sha256 of DUMP and DUMP_WITH_TOMBS for each language and asserts all three match.
5. Spot-checks the rust DUMP stream hex for the presence of the expected final key-value bytes (b=222, e=5) and the expected tombstone bytes (key a in DUMP_WITH_TOMBS).

What we deliberately didn't build

Compaction — db-21.
Block cache wiring inside Db — db-08 has the cache; db-09 doesn't need it because each SSTable reader already holds the file bytes in memory. We'll plug in the LRU during db-21 when SSTable I/O becomes cold.
Bloom-filter probing — db-04 has bloom; db-21 will skip SSTables whose Bloom rejects the key.

db-09 — Observation

What the cross-language verification actually proves.

Output of `scripts/cross_test.sh`

=== compare (DUMP, drop_tombstones=true) ===
  DUMP         rust=7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
  DUMP         go  =7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
  DUMP         cpp =7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c (35 B)
  match(DUMP): 7d1568c7bfdad9635ff655f7c4162628aa3253a7b95505c3d418362eb4c4c09c
=== compare (DUMP_WITH_TOMBS) ===
  DUMP_TOMBS   rust=27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
  DUMP_TOMBS   go  =27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
  DUMP_TOMBS   cpp =27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d (47 B)
  match(DUMP_TOMBS): 27e3d256e73c3ddbd080ad7a92e5da0be780d65896644eb7d4ec0cc8a574709d
=== spot-check stream contents ===
  spot-checks ok
=== ALL OK ===

What the canonical script exercises

PUT a 1                 # → memtable
PUT b 2                 #
PUT c 3                 #
FLUSH                   # → sst-000001.sst (a=1, b=2, c=3)

PUT b 22                # overwrite, lands in next SST
DEL a                   # tombstone, lands in next SST
PUT d 4
FLUSH                   # → sst-000002.sst (a=Tomb, b=22, d=4)

PUT e 5                 # WAL only, never flushed
DEL c                   # WAL only
PUT b 222               # WAL only

Live set after replay = {b=222, d=4, e=5} (a deleted, c deleted). With tombstones = the live set plus tombstones for a and c.

Sizes

DUMP (drop_tombstones=true):  35 bytes
  b=222 :  4(klen) + 1 + 1(type) + 4(vlen) + 3 = 13
  d=4   :  4       + 1 + 1       + 4       + 1 =  11
  e=5   :  4       + 1 + 1       + 4       + 1 =  11
                                                  ---
                                                   35  ✓

DUMP_WITH_TOMBS:  47 bytes
  35 (as above)
  + tombstone a: 4 + 1 + 1 = 6
  + tombstone c: 4 + 1 + 1 = 6
                              ---
                               47  ✓

The arithmetic matches the canonical byte format and the observed file sizes, which means we are not only matching sha256s but matching them on the right content.

What this proves

WriteBatch encoder agrees — otherwise WAL records would differ and recovery would diverge.
WAL framing + iterator agree — otherwise WAL replay would produce different memtables in the three languages.
MemTable ordering + tombstone semantics agree — otherwise the merge would produce different streams.
SSTable encoder agrees — otherwise SST files (and therefore the Entries() they yield) would differ.
Recovery procedure agrees — the dump is taken after close and reopen, so any drift in MANIFEST parsing, SST id assignment, or replay order would surface as a sha256 mismatch.
MergingIterator + SerializeStream agree — the same property db-08 verified, now exercised over a memtable+two-SST source set.

Any single bug in any of these six layers, in any one of the three languages, would break sha256 match. Matching is therefore very strong evidence of pipeline correctness end-to-end.

db-09 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake ≥ 3.20.
Rust toolchain ≥ 1.74 (rustup default stable).
Go ≥ 1.22.
shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-09-leveldb-complete
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-09-leveldb-complete/src/rust
cargo test --quiet
cargo build --release

Expected: 11 passed; 0 failed. The dbctl binary lands in target/release/dbctl.

Go

cd db-09-leveldb-complete/src/go
go test ./...
go build ./cmd/dbctl

Expected: ok github.com/10xdev/dse/db09 <duration>.

C++

cd db-09-leveldb-complete/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and the test_db09 binary prints OK.

What "green" means

A green run guarantees:

All 33+ unit tests pass (11 each in Rust, Go, C++).
The cross-language test produces byte-identical DUMP and DUMP_WITH_TOMBS after a close/reopen cycle.
Spot-checked hex bytes for b=222, e=5, and the tombstone for a are present in the stream — guarding against accidental empty-output regressions.

When verification fails

Cross-language sha256 mismatch — almost always a divergence in one of: WriteBatch wire format, MANIFEST line format, SST writer ordering, MergingIterator tie-break, or whether tombstones are dropped. The fix is almost never in db-09; it's in the upstream lab whose format drifted.
Recovery test fails in one language only — that language's WAL truncation step is wrong. Pattern (all three use it): close WAL → remove file → reopen WAL.
C++ ctest reports zero tests — you accidentally did add_subdirectory(../db-NN). Compile upstream .cc directly instead.

db-09 — Broader Ideas

Where to take this engine next, and where it already touches the rest of distributed-systems engineering.

Immediate next labs

db-10 — B-tree fundamentals. The "other half" of storage. LSMs optimize for write-heavy workloads with append-only files and amortized rewrites; B-trees optimize for in-place point updates and range scans. Both shapes appear in every production database (often side by side: Postgres heap + WAL is B-tree-like, with TOAST and rolled-back versions reaped by VACUUM; MySQL/InnoDB is B-tree primary + UNDO log).
db-21 — Storage-engine advanced. Compaction, leveled compaction policy, block cache wiring, bloom-filter probing, snapshots, file garbage collection. Everything that "real" LevelDB/RocksDB has that we postponed in db-09.

How this lab's pieces show up in distributed systems

MANIFEST as a tiny version-edit log is a microcosm of how distributed systems use a log to make state changes atomic. A Raft log is the same pattern at machine granularity: apply changes to a state machine only after they're durably appended to an append-only log.
rename for atomic publish is the local-filesystem analogue of two-phase commit. The OS gives us a strong primitive (rename is atomic under crash) and we lean on it. Distributed systems have to build equivalent primitives (Paxos / Raft / 2PC) because no underlying layer provides them for free.
Newest-source-wins under a total order on writes is exactly how a CRDT LWW-register, a multi-version concurrency control snapshot read, a Kafka log-compaction "last value wins" topic, and a Bigtable per-cell-with-timestamp work. The variable that changes between systems is what defines "newer" (file id here; sequence number in LevelDB proper; timestamp in Cassandra; vector clock in Dynamo).

Performance experiments worth running later

These are not required for the lab to be green; they are good Saturday afternoon explorations:

Plot read-amplification growth as L0 grows: write N batches, never flush, measure point-lookup latency vs N.
Replace text MANIFEST with LevelDB-style version-edit log; measure flush latency improvement at large live-set counts.
Add a block cache between SstReader::Get and the file bytes; measure hit rate on a Zipfian workload.
Wire bloom filters (db-04) per SSTable; measure how many SSTs you can skip for a typical miss-heavy workload.

What "production-ready" would require beyond this lab

Concurrent writers (a real Mutex on the write path, multiple readers via versioned snapshots).
Group commit (batch many WAL appends behind one fsync).
Direct I/O / pwrite-based SST writer to avoid double-buffering.
Checksums on every block read, not only at SST footer level.
A scheduler for background flush + compaction with admission control.
fsync(dir) after every file create / rename to survive metadata-loss scenarios on certain filesystems.

None of these change the shape of the engine — they make the same shape faster and tougher.

db-09 step 01 — The write path

Goal

Implement Db::open(dir), put(k,v), delete(k), and Write(WriteBatch) such that every successful return has been durably persisted to the WAL.

Tasks

Pick the on-disk layout (MANIFEST, wal.log, sst-NNNNNN.sst).
Define the WriteBatch wire format. Use a single encoder/decoder so the in-memory batch representation and the WAL record payload are identical bytes.
On open(dir):
- mkdir -p the directory.
- Read MANIFEST (if it exists) one line at a time; collect SST ids newest-first.
- Open each SSTable in order; track max_id.
- Replay wal.log with WalIter: decode each record as a batch and apply to a fresh memtable.
- Open the WAL for writes; set next_id = max_id + 1.
Implement Write(batch):
- Reject the empty batch as a no-op (don't write an empty WAL record).
- bytes = batch.encode(); wal.append(bytes); wal.sync();
- Apply the batch to the memtable.
put and delete are thin wrappers that build a one-op batch.

Acceptance

Inline unit tests:

batch_roundtrip — encode → decode round-trip preserves three representative ops (Put, Delete, Put-with-empty-key).
batch_rejects_trailing — decoding rejects a one-byte-suffix-corrupted payload.
put_get_memtable — put("a","1") followed by get("a") returns Some("1"); get("missing") returns None.
delete_shadows — put then delete makes get return None.

All four green in Rust, Go, and C++.

Discussion prompts

Why must wal.sync() happen before applying to the memtable, not after?
What invariant would break if we let Write proceed for an empty batch?
How would a group-commit optimization preserve the same durability guarantee while batching multiple Write calls behind a single fsync?

db-09 step 02 — Flush and recovery

Goal

Implement Flush() and recovery such that crashes between any two file operations never produce an inconsistent live set.

Tasks

Implement Flush() as the strict sequence:
1. If memtable is empty, return.
2. Allocate id = next_id; next_id += 1.
3. Build an SstWriter from memtable.sorted(). For each entry, map EntryType::Value→Value (with bytes) and EntryType::Tombstone→ Tombstone (empty value).
4. Write sst-<id>.sst.tmp durably (open + write + fsync).
5. rename it to sst-<id>.sst.
6. Prepend (id, SstReader) to the in-memory ssts list (newest first).
7. Rewrite MANIFEST atomically: write MANIFEST.tmp durably (one L0 <id> line per live SST, newest first), then rename to MANIFEST.
8. Close the WAL, remove wal.log, reopen the WAL.
9. Replace memtable with an empty one.

Verify the recovery sequence implemented in step 01 still satisfies the crash matrix:

Crash between …	Effect on next open
step 4 and 5	leftover `*.tmp` file, ignored on next open
step 5 and 7	leftover unlisted SST file, ignored on next open
step 7 and 8	replayed WAL re-applies writes that are also in the latest SST — idempotent because `MemTable::put` is overwrite
step 8 and 9	impossible — both are in-memory only after this point

Acceptance

Inline unit tests:

flush_creates_sst — after Flush(), memtable empty and LiveSstIds().len() == 1; reads still work.
flush_then_recover — Flush(), drop Db, reopen, reads still return the flushed values.
wal_replay — without flushing, drop Db, reopen; memtable has the pre-crash writes.
newest_sst_wins — two flushes with overlapping keys; the value from the newer flush is returned.
recovery_after_flush_plus_wal — mix: flush, then write more (tombstones + puts) without flushing, drop, reopen; reads reflect both the flushed and the WAL-only writes correctly.

All five green in Rust, Go, and C++.

Discussion prompts

Why prepend instead of append to the ssts list?
Why is it safe to truncate the WAL even when the new MANIFEST may not yet be fsync'd to its parent directory?
What would change if step 7 used an edit log (append a "+id" record) instead of rewriting the whole file?

db-09 step 03 — CLI and cross-language byte-identity

Goal

Build a dbctl --dir DIR CLI in all three languages that reads commands from stdin, then assert via sha256 that all three produce byte-identical output for the same canonical script — including after a crash/recover cycle.

CLI contract

Each line of stdin is one of:

# comment (ignored)
PUT  <key>  <value>      # whitespace-delimited (no spaces inside)
DEL  <key>
FLUSH
DUMP                     # write serialize_view(drop_tombstones=true) to stdout
DUMP_WITH_TOMBS          # write serialize_view(drop_tombstones=false) to stdout

Blank lines and lines starting with # are ignored.

DUMP and DUMP_WITH_TOMBS write raw bytes (no trailing newline) so that sha256 over stdout is a pure function of the database state.

Tasks

Build dbctl in Rust (src/rust/src/bin/dbctl.rs), Go (src/go/cmd/dbctl/main.go), and C++ (src/cpp/src/dbctl.cc).
Write scripts/cross_test.sh that:
1. Builds all three binaries.
2. Creates one canonical command script that exercises multi-flush, overwrites that land in newer SSTables, tombstones, and a non-empty WAL tail.
3. For each language: pipes the script into dbctl --dir db-LANG (which fully writes and closes), then reopens the directory and pipes DUMP (and separately DUMP_WITH_TOMBS) into a file.
4. Computes sha256 over each dump file; asserts all three match.
5. Spot-checks the rust DUMP stream hex for the expected post-recovery key-value bytes to guard against silent-empty regressions.
Write scripts/verify.sh that runs unit tests in all three languages.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
...
  match(DUMP):       7d1568c7...
  match(DUMP_TOMBS): 27e3d256...
  spot-checks ok
=== ALL OK ===

A byte-identical DUMP after reopen is a near-proof of correctness for the entire encode → flush → MANIFEST → recover → merge → serialize pipeline across three independent implementations.

Discussion prompts

Why force a close+reopen between the writes and the DUMP, instead of dumping from the same process?
Why is DUMP (without tombstones) sufficient on its own not a sound proof? What does DUMP_WITH_TOMBS add?
If the three sha256s ever diverge, which lab's format is the most probable culprit, and why?

db-10 — B-Tree Fundamentals

The first lab of the B-tree track. Up to db-09 every persistent structure we built was an LSM (log + sorted runs + merge). Postgres, MySQL/InnoDB, SQLite, Oracle, SQL Server, and most embedded key-value engines you have never heard of are B-trees instead. This lab builds the in-memory kernel; db-11 wraps it in a pager so it can live on disk.

What is it?

A self-balancing search tree where every node holds up to 2T - 1 keys (and, if internal, up to 2T children). We pick the smallest non-trivial degree T = 2, giving:

1 ≤ keys ≤ 3 per non-root node
2 ≤ children ≤ 4 per non-root internal node
root may hold 1..3 keys (or 0 if the tree is empty)

The algorithms are the textbook CLRS B-tree: insert splits a child proactively while descending if it is full; delete rebalances a child proactively while descending if it would underflow. With this discipline every operation is exactly one root-to-leaf traversal — no second pass, no recursion back up to fix invariants.

Keys and values are arbitrary byte slices; comparison is lexicographic. Each node carries the value of every key it holds (this is a B-tree, not a B+-tree — values do not live exclusively in the leaves). db-11 will make the leaf-only choice when we introduce the pager and need to keep internal nodes small.

Why does it matter?

Predictable depth. log_T(n) height with T=2 gives a small, perfectly bounded number of comparisons per lookup, no matter the insertion order. LSMs trade log writes for O(log levels) read amplification; B-trees trade page rewrites for a tight bound.
In-place update. A B-tree key update mutates exactly one node. LSMs append a new record and reclaim the old one during compaction. Which is better depends on workload — db-22 will measure it.
The canonical study substrate. Every working storage engineer has implemented a B-tree at least once. Splits and merges are the microcosm of every concurrent, copy-on-write, or page-versioned variant that exists in production code.

How does it work?

Node layout

        ┌─────────────────────────── Node ────────────────────────────┐
        │  is_leaf : bool                                             │
        │  keys    : Vec<(key, value)>      // 1..3 entries           │
        │  children: Vec<Box<Node>>         // 0 if leaf, else nkeys+1│
        └─────────────────────────────────────────────────────────────┘

Internal node with two keys (k0 < k1):

        ┌──────────┬──────────┐
        │   k0,v0  │   k1,v1  │
        └─┬──────┬─┴────────┬─┘
          │      │          │
          ▼      ▼          ▼
        c0 keys  c0<…<k0    k0<…<k1    c2 keys k1<…

Insert (proactive split)

Descend from the root. Before stepping into any full child (nkeys == 3), split that child in place: promote its middle key to the parent, drop the right sibling into the parent's child list at position i+1, and let the new (now non-full) child take the descent. If the root itself is full, grow upwards: create a new parent with the old root as its only child, then split. This is the only place tree height increases.

Before split (child too full):     After split (middle promoted):

   [   K   ]                          [ K , k1 ]
        │                              │      │
        ▼                              ▼      ▼
  [k0, k1, k2]                       [k0]    [k2]

Delete (proactive rebalance)

Descend from the root looking for the key. Before stepping into any child that has only T - 1 = 1 key, ensure it has at least T = 2 keys by one of:

Borrow from left sibling — rotate left sibling's last key up into the parent, parent's separator down into the child's front.
Borrow from right sibling — symmetric.
Merge with a sibling — pull the parent's separator down, concatenate child + separator + sibling into a single node with 2T - 1 = 3 keys.

If the root becomes an empty internal node (only one child, no keys) after the operation, collapse it: the root's only child becomes the new root. This is the only place tree height decreases.

Deletes that hit an internal key are handled by replacing the key with its in-order successor (or predecessor) and recursing the delete into that subtree, where the recursion terminates at a leaf.

Canonical serialization

A preorder traversal of the tree emitting, per node:

u8     is_leaf            (1 if leaf, 0 if internal)
u32 LE nkeys
nkeys * { u32 LE klen | klen bytes key | u32 LE vlen | vlen bytes val }
if !is_leaf:
   (nkeys + 1) * recurse(child)

The empty tree therefore serializes as five bytes: 01 00 00 00 00 (one leaf node with zero keys).

This format captures structure, not just contents. Two trees with the same {(key, value)} set but different splits / shapes produce different byte sequences — so scripts/cross_test.sh would catch a language whose insertion order or split rule diverged, even if the externally-visible scan output still agreed.

Deterministic workload

run_workload(scenario, seed, ops) drives a fresh tree using SplitMix64(seed) to generate keys (8-byte big-endian indices modulo a 200-entry key space) and values (4-byte big-endian). The three scenarios:

scenario	per-iteration behavior
`inserts`	always `insert(key, val)`
`deletes`	insert during the first half, `delete(key)` during the second
`mixed`	bits 62..63 of `r1` decide: insert (0,1), delete (2), no-op (3)

Two PRNG outputs are consumed per iteration regardless of which branch is taken, so the key sequence is invariant under the scenario choice and only the operation kind differs. This makes the three scenarios easy to reason about: they all visit the same keys in the same order.

The `btreectl` CLI

btreectl --seed N --ops M --scenario {inserts|deletes|mixed}

Runs the chosen workload and writes the serialized tree to stdout (raw bytes, no trailing newline).

Cross-language invariant

scripts/cross_test.sh runs the same (seed, ops, scenario) triple through Rust, Go, and C++ btreectl binaries and asserts that all three produce byte-identical output via sha256 for two scenarios:

scenario	seed	ops	sha256	size
A `inserts`	42	500	`4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088`	varies
B `mixed`	7	500	`9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718`	2515 B

A matching hash proves that all three implementations agree on: the PRNG, the lexicographic key compare, the proactive-split insertion, the proactive-rebalance deletion, and the precise tree shape after the workload. Any drift in any of these surfaces as a sha256 mismatch.

What's intentionally out of scope

Persistence. db-11 introduces the pager and turns nodes into fixed-size disk pages.
B+-tree leaves-only-values layout. Also db-11; it's the natural change once internal nodes need to fit one to a page.
Concurrent / lock-coupling B-trees. db-13 (MVCC) and db-21 (storage-engine advanced) explore copy-on-write and latch protocols.
Variable-length keys with prefix compression. SQLite and RocksDB both do this; we will revisit in db-15.

db-10 — References

Primary source

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Introduction to Algorithms, 3rd or 4th ed., MIT Press. Chapter 18 (B-Trees) is the textbook treatment whose proactive- split / proactive-rebalance discipline we follow line-for-line. This is the single most important reference for the lab.

Original papers

R. Bayer & E. McCreight, Organization and Maintenance of Large Ordered Indices, Acta Informatica 1, 1972. The paper that introduced the B-tree.
D. Comer, The Ubiquitous B-Tree, ACM Computing Surveys 11(2), 1979. The classic survey; explains B+, B*, and variants. A useful read before starting db-11 where the leaves-only layout enters.

Production engines that use B-trees or B+-trees

SQLite — pure B+-tree pager-backed; the closest analog to what db-10 + db-11 will become by db-15. https://www.sqlite.org/arch.html
Postgres — heap files plus B-tree indexes; index pages are very similar to SQLite's internal nodes. https://www.postgresql.org/docs/current/btree-implementation.html
MySQL / InnoDB — clustered B+-tree per table. https://dev.mysql.com/doc/refman/8.0/en/innodb-physical-structure.html
LMDB — B+-tree with copy-on-write pages; one of the cleanest open-source B-tree implementations to read. https://www.symas.com/lmdb
BoltDB / bbolt — Go port of LMDB; readable and small. https://github.com/etcd-io/bbolt

Cross-lab dependencies

None upstream. db-10 is the start of the B-tree track and imports no earlier labs.
Downstream consumers: db-11 (pager) wraps each node in a fixed- size disk page; db-12 (SQL frontend) treats the tree as the table storage layer; db-13 (MVCC) snapshots node references rather than page bytes; db-14 (indexes) builds secondary B-trees over the primary tree's keys.

db-10 — Analysis

What had to be decided before any code was written, and why each decision shapes the next 5 labs.

Required invariants

Search-tree order. For every internal node with keys k0 < k1 < … < kn-1 and children c0, c1, …, cn, every key in c_i is < k_i and every key in c_{i+1} is > k_i.
Bounded fanout. Non-root nodes hold between T - 1 and 2T - 1 keys (1..3 with T = 2). The root may hold fewer keys, only when the tree is otherwise empty or being collapsed.
Uniform depth. All leaves sit at the same depth from the root. This is what makes the worst-case lookup guaranteed to be O(log_T n), not merely expected.
Proactive split / rebalance. The descent on insert never needs to back up to fix an overflow; the descent on delete never needs to back up to fix an underflow. Each mutating operation touches each level on the path exactly once.
Canonical serialization. Two B-trees with the same shape must serialize to the same bytes regardless of insertion order; two B-trees with different shapes must serialize to different bytes even if they hold the same key-value set.

Design decisions

Why `T = 2` (smallest non-trivial degree)

Larger T means flatter trees and more keys per page — what real B-trees use to amortize disk I/O. But the algorithms are identical at every T ≥ 2, and T = 2 makes splits and merges frequent, which makes them easy to spot, easy to unit-test, and easy to render in a hex dump. db-11 will bump T to something realistic (e.g. matching a 4 KiB page) once nodes are pages.

Why B-tree, not B+-tree

A B+-tree puts values only in the leaves and threads the leaves into a doubly-linked list for range scans. That's the right call once nodes are disk pages — internal nodes shrink because they don't carry values, so fanout (and therefore depth) wins. In-memory, with T = 2, the values-in-every-node B-tree is simpler and the savings would be invisible. db-11 swaps to a B+-tree when the disk-page trade-off applies.

Why the wire format encodes structure, not just contents

Two trees with the same {(k, v)} set can have different shapes if they were built by different insertion orders. A serializer that only emits the in-order key list (essentially scan()) would let a serious bug — say, swapping the left and right halves of a split — hide forever, because the bug would manifest only as different tree shapes, never different scan results.

By emitting the full preorder shape, byte-equality across languages is byte-equality of the trees' physical state. db-11 will reuse this property: the page-byte serialization of a B+-tree should be exactly reproducible across implementations.

Why the workload generator reads two PRNG outputs unconditionally

Each run_workload iteration consumes exactly r1 and r2, regardless of whether the chosen scenario insertions, deletes, or no-ops. If a scenario consumed a variable number of PRNG draws, the sequence of keys would diverge across scenarios for the same seed, making the cross-scenario hashes incomparable and the bug hunt much harder.

The cost: a small amount of wasted entropy on no-op iterations. The gain: scenarios inserts, deletes, and mixed all visit the same key-space in the same order for the same seed, so any divergence is the operations' fault, not the keys'.

Why scenarios live in the library, not in the CLI

run_workload(...) is a library function that returns a BTree. The btreectl binary is a one-liner around it. This means the inline unit tests can call run_workload("mixed", 42, 500) directly and assert determinism with no shell-out, no file I/O, and no path-dependent flakiness. The same property lets cross_test.sh trivially compare three independent CLI binaries.

Why three languages

Forces the API to be small and explicit. The Rust Box<Node> recursion translates to Go's struct pointer recursion and C++'s std::unique_ptr<Node> recursion; if the algorithm needs language-specific cleverness, you've over-fit to one runtime.
Pins integer arithmetic. SplitMix64 uses wrapping unsigned multiplication; expressing it identically in three languages is a forcing function for the cross-language hash to match.
Provides a deterministic conformance suite for the whole B-tree track. When db-11's pager produces a tree whose in-memory shape disagrees with the pure in-memory baseline, db-10's serializer is the comparison witness.

Tradeoffs worth flagging

The serializer recurses on the call stack. For pathologically deep trees this could overflow. With T = 2 and 64-bit keys drawn from a 200-key space, the worst-case height is roughly log_2 200 ≈ 8 and the stack is never the bottleneck. db-11's paged variant will be even shallower and is fine to keep recursive.
Keys and values are stored as owned Vec<u8> / []byte / std::vector<uint8_t>. This is the simplest correct choice and it dominates allocation cost. db-22 (perf) will revisit whether to intern, slice, or arena-allocate.
delete returns bool (was-present) rather than the removed value. Sufficient for testing; some real engines need the payload (e.g., to free its backing buffer). Easy to extend.

db-10 — Execution

What was built, in the order it was built.

1. Rust (`src/rust`)

Cargo.toml declares crate btree10 (lib) and a binary btreectl. Edition 2021, lto = "thin", codegen-units = 1 for release.
src/lib.rs contains:
- Constants T = 2, MAX_KEYS = 3, MIN_KEYS = 1.
- Node { is_leaf, keys: Vec<(Vec<u8>, Vec<u8>)>, children: Vec<Box<Node>> }.
- BTree with new, get, insert, delete, serialize_tree, scan, len, is_empty.
- Free functions split_child, insert_nonfull, delete_from, plus the rebalance helpers borrow_from_prev, borrow_from_next, merge_children.
- SplitMix64 PRNG (the textbook wrapping-add + xor-mul mix).
- run_workload(scenario, seed, ops) -> BTree.
- Inline #[cfg(test)] tests: empty-tree shape, single insert+get, insert + scan ordered, delete-of-absent returns false, delete-then-get returns None, deterministic shape under the three scenarios, scenario-cross seed independence.
src/bin/btreectl.rs: thin arg parser (--seed, --ops, --scenario), calls run_workload, writes serialize_tree() bytes to stdout.

2. Go (`src/go`)

go.mod module github.com/10xdev/dse/db10, Go 1.22.
btree.go ports the Rust API one-for-one. Pointer-based recursion: *node instead of Box<Node>. The serializer is byte-identical to Rust's: same preorder, same little-endian encodings.
btree_test.go mirrors all Rust tests.
cmd/btreectl/main.go is the matching CLI.

3. C++ (`src/cpp`)

CMakeLists.txt builds:
- btree10_lib (static library from src/btree.cc).
- btreectl (binary linking btree10_lib).
- test_btree10 (ctest target linking btree10_lib).
- Flags: -Wall -Wextra -Wpedantic -Werror -O3 -DNDEBUG in Release.
src/btree.h declares Node, BTree, run_workload, SplitMix64.
src/btree.cc implements them. std::unique_ptr<Node> plays the role of Rust's Box<Node>.
src/btreectl.cc is the CLI.
tests/test_btree10.cc mirrors Rust's inline tests. Uses #undef NDEBUG before <cassert> so asserts fire under Release; never assert(side_effect).

4. Scripts

scripts/verify.sh builds and runs unit tests in all three languages. Exits 0 only if all three are green; prints === OK ===.
scripts/cross_test.sh:
1. Builds Rust/Go/C++ btreectl binaries.
2. Scenario A: btreectl --seed 42 --ops 500 --scenario inserts in each language; sha256 + size comparison.
3. Scenario B: btreectl --seed 7 --ops 500 --scenario mixed; sha256 + size comparison.
4. Spot-check on the rust scenario-A output: assert a known key-prefix appears in the hex stream, guarding against silent-empty-output regressions.
5. Print === ALL OK ===.

What was deliberately not built

Persistence. No file I/O, no page format. db-11.
Range scans with iterator-style streaming. scan() returns the whole list; sufficient for tests, lazy for the spec.
Bulk-loading from a sorted input. A real B-tree would offer a fast path that builds the tree bottom-up. db-15 may revisit.
Concurrency control. No latches, no locks. Trees of T = 2 fit comfortably in a single thread's working set and the lab has no concurrent test harness.

db-10 — Observation

What the cross-language verification actually proves, and what the serialized stream looks like by hand.

Output of `scripts/cross_test.sh`

=== compare Scenario A (inserts seed=42 ops=500) ===
  A          rust=4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 (    ???? B)
  A          go  =4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 (    ???? B)
  A          cpp =4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088 (    ???? B)
  match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
=== compare Scenario B (mixed seed=7 ops=500) ===
  B          rust=9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 (    2515 B)
  B          go  =9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 (    2515 B)
  B          cpp =9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718 (    2515 B)
  match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== spot-check stream contents ===
  spot-checks ok
=== ALL OK ===

Reading the stream by hand

The empty tree is exactly five bytes:

01           is_leaf = 1
00 00 00 00  nkeys   = 0

After one insert (key="a", val="1"):

01                    is_leaf = 1
01 00 00 00           nkeys   = 1
01 00 00 00           klen    = 1
61                    key     = "a"
01 00 00 00           vlen    = 1
31                    val     = "1"

After the fourth distinct key, the root must split:

00                    is_leaf = 0                   ← became internal
01 00 00 00           nkeys   = 1
04 00 00 00           klen    = 4                   ← promoted middle key
… key bytes …
04 00 00 00           vlen    = 4
… val bytes …
01 00 00 00 …         left child (preorder)
01 00 00 00 …         right child (preorder)

The is_leaf byte changes from 01 to 00 precisely at the moment the root grows upwards. There is no other operation that flips this byte for the root.

What the matching sha256 proves

A single matching match(...) line proves that all three implementations agree on, at the byte level:

The PRNG. Any drift in SplitMix64 would shuffle the key stream and the very first byte of the serialized tree would change.
The lexicographic byte compare. Different ordering would re-route the descent at every internal node from key 4 onward.
The proactive-split rule. Different split rules would produce different children counts and nkeys fields at every level above the leaves.
The proactive-rebalance rule (Scenario B). The mixed scenario hits both insert and delete paths; the matching hash proves the borrow/merge logic agrees across all three.
The preorder serializer with little-endian length prefixes. Different endianness or different node order would flip every single multi-byte field in the stream.

Any one of these going wrong, in any one of the three languages, makes the hashes diverge.

Sizes

Scenario B settles at exactly 2515 B for seed=7, ops=500, scenario=mixed. The Scenario A size varies but is also identical across all three languages (see the script output).

Spot-check rationale

The script greps the Rust scenario-A output for a known key prefix that must be inserted by SplitMix64(42)'s first few outputs. This guards against the silent-success regression where every language is "successfully" producing the same five-byte empty-tree header and nothing else.

db-10 — Verification

Prerequisites

macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake ≥ 3.20.
Rust toolchain ≥ 1.74.
Go ≥ 1.22.
shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-10-btree-fundamentals
scripts/verify.sh        # unit tests, all three languages
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-10-btree-fundamentals/src/rust
cargo test --quiet
cargo build --release

Expected: all inline tests pass. The btreectl binary lands in target/release/btreectl.

Go

cd db-10-btree-fundamentals/src/go
go test ./...
go build ./cmd/btreectl

Expected: ok github.com/10xdev/dse/db10 <duration>.

C++

cd db-10-btree-fundamentals/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and the test_btree10 target prints OK.

What "green" means

A green run guarantees:

All inline unit tests pass in Rust, Go, and C++.
The cross-language test produces byte-identical serialized trees for both canonical scenarios:

scenario seed ops sha256

A inserts 42 500 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088

B mixed 7 500 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718

Matching sha256 across three independent implementations proves agreement on the PRNG, the lexicographic compare, the proactive- split insert, the proactive-rebalance delete, and the precise tree shape after the workload.
The spot-check confirms the stream is non-empty and contains an expected key prefix, guarding against the regression where all three languages "successfully" produce the same five-byte empty-tree header.

When verification fails

Cross-language sha256 mismatch on the very first byte — SplitMix64 divergence or wrong initial node is_leaf value.
Mismatch deep in the stream after matching headers — split or rebalance asymmetry; almost always a borrow-vs-merge decision that goes one way in two languages and the other in the third.
One language's scenario A matches but scenario B does not — a delete-path bug specific to that language. The inserts scenario never invokes delete, so it would not exercise the faulty path.
All three sha256s match each other but disagree with the baked-in expected hashes — a legitimate algorithm change. Make sure it was intentional, then update cross_test.sh and the table above in the same commit.

db-10 — Broader Ideas

Where this in-memory B-tree fits in the rest of the track, and which real-world techniques live one or two steps beyond it.

Immediate next labs

db-11 — Pager system. Wraps each node in a fixed-size disk page. Trades the heap-allocated Box<Node> recursion for a PageId-indexed page cache plus a free-list. Introduces the B+-tree layout (values only in leaves; leaves doubly linked for range scans) because internal nodes must fit one to a page.
db-12 — SQL frontend. Parses a small SQL subset (CREATE TABLE, INSERT, SELECT, UPDATE, DELETE), plans it into a B+-tree-backed table, and exposes a REPL.
db-13 — Transactions and MVCC. Versioned B+-tree pages so readers do not block writers. Snapshots are root-page references at a given commit timestamp.
db-14 — Indexes and query optimization. Secondary B-trees whose keys are (indexed_column, primary_key) pairs. Plans index scans, index-only scans, and merge joins.
db-15 — SQLite-complete. Everything above stitched into one executable; the B-tree track's counterpart to db-09.

How this lab's pieces show up in real systems

T = 2 "demo size" B-trees are exactly what every textbook uses as a teaching aid, including the one most engineers learn on. Production engines use T chosen to fit a 4 KiB / 8 KiB page, but the algorithms are unchanged.
Proactive split / rebalance is the standard discipline; the alternative (descend, then walk back up to fix overflows) is textbook for binary search trees but rare in B-trees because it makes concurrency control much harder.
Preorder canonical serialization is the same shape SQLite uses for its page_dump tooling and what RocksDB's sst_dump produces for its SSTables. Every storage engineer needs some byte-exact dump format; here we picked the simplest one that captures structure.
SplitMix64 is the standard hash-mixing primitive used by modern hash tables (Java 8 HashMap, Go's runtime-internal randn, and absl::flat_hash_map's perturbation). Using it for the workload generator means the keys we touch are realistically randomly distributed, not pathologically biased.

Performance experiments worth running later

Plot len() vs serialized size to see the per-key overhead at T = 2. Compare to T = 64 (db-11) to see how internal-node shrinkage from B+-tree leaves changes the breakdown.
Sweep KEY_SPACE from 100 up to 100 000 and watch the insert-delete-insert workload's steady-state size oscillate.
Replace the recursive serialize_tree with an explicit-stack iterative version and measure the wall-time gap. Useful prep before db-22.

What "production-quality" would require beyond this lab

Variable-length keys with prefix compression on the page.
Page-level checksums and a magic byte at offset 0 so a corrupted read fails loudly instead of returning random keys.
Free-list management for reclaimed pages after deletes (db-11).
Concurrent insert/delete protocols: latch coupling, optimistic lock coupling, or Bayer's "B-link tree" right-link technique for no-blocking traversal during split.
Copy-on-write pages so readers see a consistent snapshot during writes (LMDB-style).
A persistent "wal" record per page mutation so the tree can be replayed on recovery (db-03 / db-11).

None of these change the shape of the in-memory algorithms — they add policies on top of the same proactive-split / proactive- rebalance kernel built here.

db-10 step 01 — Tree shape and get / scan

Goal

Define the in-memory B-tree's node representation and implement the two read-only operations: point lookup get(k) and ordered scan() -> Vec<(k,v)>. No mutation yet; this step is about pinning the data structure.

Tasks

Declare constants T = 2, MIN_KEYS = T - 1, MAX_KEYS = 2T - 1.
Define a Node containing:
- is_leaf: bool
- keys: Vec<(Vec<u8>, Vec<u8>)> — sorted by key
- children: Vec<Box<Node>> — empty for leaves; for internal nodes, children.len() == keys.len() + 1
Define BTree { root: Box<Node> }. new() produces an empty leaf root.
Implement get(&self, key: &[u8]) -> Option<Vec<u8>> by descending from the root: at each node, find the first key >= target; if equal, return its value; if leaf, return None; else recurse into children[i].
Implement scan(&self) -> Vec<(Vec<u8>, Vec<u8>)> as the standard in-order traversal: for each i in 0..keys.len(), recurse into children[i], push keys[i]; finally recurse into the last child.
Implement len() and is_empty() as helpers.

Acceptance

Inline unit tests:

get_on_empty_returns_none — BTree::new().get(b"k") == None.
manual_build_get_returns_value — manually construct a 3-key leaf, get returns the right value for each key and None for misses.
scan_is_sorted — manually construct an internal node with two leaf children; scan() returns the merged sorted sequence.

All three green in Rust, Go, and C++.

Discussion prompts

Why does get use linear scan over keys rather than binary search? For T = 2 the answer is obvious; for T = 256 is it still?
Why is is_leaf stored on each node rather than inferred from children.is_empty()?
What goes wrong if scan recurses into the last child before pushing the last key?

db-10 step 02 — Insert and delete (split, borrow, merge)

Goal

Implement mutation: insert(k, v) and delete(k) -> bool. Both operations must preserve the height invariant — every leaf at the same depth, every node within [MIN_KEYS, MAX_KEYS] (except the root).

Tasks

Insert.
- If root.keys.len() == MAX_KEYS, grow up: wrap the old root in a new internal root with one child, then split_child(new_root, 0). This is the only place height ever increases.
- Then insert_nonfull(root, k, v).
insert_nonfull(node, k, v).
- If node.is_leaf: splice the entry into the sorted slot. If the key already exists, overwrite the value in place.
- Else: find i such that key <= node.keys[i].0 (the child whose range covers the key). If children[i] is full, split first (pre-emptive split), then if key > node.keys[i].0 advance i. Recurse into children[i].
split_child(parent, i). Precondition: parent.children[i].keys.len() == MAX_KEYS == 3. Effect:
- Promote the middle key (index 1) into parent.keys[i].
- Move the right half (key 2 plus children 2..=3) into a new sibling at parent.children[i + 1].
- The left half (key 0 plus children 0..=1) remains in parent.children[i].
Delete. Recursive delete_from(node, k) -> bool that maintains the invariant the node we're descending into has ≥ T keys. Three cases at a leaf or internal node hit:
- Key in this leaf → splice out, return true.
- Key in this internal node → replace with in-order predecessor or successor (drawn from whichever neighbor child has ≥ T keys), then recursively delete that pred/succ.
- Key not in this subtree (descending into children[i]):
  - If children[i].keys.len() < T, borrow from children[i-1] or children[i+1] if one of them has > MIN_KEYS. Prefer left. Otherwise merge children[i] with its left or right sibling (prefer right if it exists, else left), pulling the separating key from the parent.
  - Recurse into children[i] (which is now safe).
After delete_from returns, if root became a keyless internal node, collapse: root = root.children.remove(0). This is the only place height ever decreases.

Acceptance

Inline unit tests:

insert_then_get_roundtrip — insert 50 keys, all of them retrievable.
insert_overwrites — inserting ("k", "v1") then ("k", "v2") yields get("k") == "v2" and len() == 1.
delete_existing_returns_true — insert "k", delete "k" returns true, get("k") returns None.
delete_missing_returns_false — BTree::new().delete(b"k") is false.
inserts_grow_tree — insert enough keys to force at least one grow-up; check len() matches insertions.
deletes_shrink_tree — insert N keys then delete them all; len() goes to 0, tree is still well-formed (collapsed root).

All six green in Rust, Go, and C++.

Discussion prompts

Why is pre-emptive split preferred over "descend, recurse, split on the way back"?
For deletion, why must we ensure children[i].keys.len() >= T before descending, not after?
What's the tie-break rule when both siblings have spare keys — borrow from left or right? What's the cost of getting it wrong?
How would copy-on-write change split_child and delete_from?

db-10 step 03 — Serialize + CLI + cross-language byte-identity

Goal

Define a canonical wire format for the tree, build a btreectl CLI that runs a deterministic workload and writes the serialized tree to stdout, then prove via sha256 that all three implementations produce identical bytes for two distinct scenarios.

Wire format

Preorder traversal. Per node, in this exact order:

u8        is_leaf                  (1 = leaf, 0 = internal)
u32 LE    nkeys
nkeys *   { u32 LE klen, key bytes, u32 LE vlen, val bytes }
if !is_leaf:
    (nkeys + 1) * recurse(child_j)

All length prefixes are little-endian (matches every other lab in the workspace). The empty tree serializes as 01 00 00 00 00 (one empty leaf).

Deterministic workload

KEY_SPACE = 200

run_workload(scenario, seed, ops):
    rng = SplitMix64(seed)
    tree = BTree::new()
    for _ in 0..ops:
        r1 = rng.next_u64()
        r2 = rng.next_u64()                 # ALWAYS draw both
        key = (r1 % KEY_SPACE).to_be_bytes()  # u64 BE = 8 bytes
        val = (r2 as u32).to_be_bytes()       # u32 BE = 4 bytes
        match scenario:
            "inserts" : tree.insert(&key, &val)
            "deletes" : if i < ops/2: tree.insert(&key, &val) else: tree.delete(&key)
            "mixed"   : op = (r1 >> 62) & 0x3
                        0 | 1 -> insert ; 2 -> delete ; 3 -> skip
    return tree

Two PRNG draws per iteration is non-negotiable; if any implementation short-circuits the second draw on a skip branch, the seed → state mapping desyncs.

CLI contract

btreectl --seed N --ops M --scenario {inserts | deletes | mixed}

Writes the canonical wire-format bytes (no trailing newline) to stdout.

Tasks

Add serialize_tree(&self) -> Vec<u8> to BTree. Pure function; does not mutate the tree.
Implement the SplitMix64 PRNG with the standard constants (0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB).
Implement run_workload per the spec above.
Implement btreectl in Rust, Go, and C++.
Write scripts/verify.sh that runs unit tests in all three langs.
Write scripts/cross_test.sh that:
1. Builds all three binaries.
2. Scenario A: btreectl --seed 42 --ops 500 --scenario inserts → sha256 all three → assert match. Hash: 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088.
3. Scenario B: btreectl --seed 7 --ops 500 --scenario mixed → sha256 all three → assert match. Hash: 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718.
4. Spot-check that the stream contains an expected byte sequence (defensive against silent-empty regressions).
5. Print === ALL OK ===.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
...
  match(A): 4b587ccce2627561c03d5db0c2c172642c9f3ed188c97fc53a215a3d0f316088
  match(B): 9edbeec6436ee549c8a52b97f286831ed340c4bb588c6371542cdf0421e37718
=== ALL OK ===

A byte-identical hash across three independent implementations for both scenarios is a near-proof that the PRNG, key/value encoding, insert path, delete path, and serialization format are all spec- compliant.

Discussion prompts

Why must we draw two PRNG outputs per iteration even when the scenario chooses to skip?
Why is the wire format preorder rather than level-order or in-order? What property does preorder preserve that the others lose?
If the Scenario-A hash matches but Scenario-B doesn't, what code paths are the prime suspects, and why?
The sha256s are baked into cross_test.sh as constants. What is the benefit, and what is the maintenance cost when the wire format legitimately evolves?

The first lab of the B-tree track where bytes leave RAM. db-10 built a B-tree out of Box<Node>s and proved three languages agreed on shape; this lab builds the substrate that turns those shapes into durable files. Every disk-backed engine in the series from here on — SQLite (db-15), MVCC (db-13), the distributed KV store (db-20), and the capstone (db-23) — sits on top of a pager. This is the component most production databases share.

What is it?

A pager is the layer that:

Carves a file into fixed-size pages (we use 4 KiB by default; tests run with 256 B to keep dumps readable).
Hands out pages by 1-based page id; page 0 is reserved for a file header that nails down the format.
Maintains an in-memory page cache of bounded capacity, evicts with LRU, and writes dirty pages back to disk on eviction and on explicit flush().
Calls fsync exactly when the user asks for durability, never on every write.

The interface is intentionally minimal:

open(path, page_size, cache_capacity) -> Pager
Pager::allocate() -> PageId            // grow file by one page
Pager::read(pid)  -> Vec<u8>           // page_size bytes
Pager::write(pid, bytes)               // bytes.len() == page_size
Pager::flush()                         // write all dirty + fsync
Pager::close()

No B-tree nodes, no records, no keys. The B+-tree in db-15 will encode those structures into the page bytes; the pager neither knows nor cares what the bytes mean.

Why does it matter?

The cache is the database. Every production engine spends most of its time hitting a buffer pool, not reading disk. The LRU policy, the dirty bit, and the eviction discipline are the difference between "fits in RAM, fast" and "thrashes, dead".
The file layout is a binding contract. Once two implementations agree on byte 0 of every page, the database is portable across languages and platforms. db-15 will reuse this contract; the cross-language hash test in this lab proves it holds before the B+-tree code ever runs.
fsync is the only thing that buys durability. Every other write is just a hint to the OS. Knowing exactly when fsync runs (and why) is what separates working systems from data-loss outages.

How does it work?

File layout

offset 0                            offset = N * page_size
┌─────────────────────────┬─────────────┬─────────────┬─────┐
│  page 0 (header)        │   page 1    │   page 2    │ ... │
│  magic | psz | npages   │ user bytes  │ user bytes  │     │
│  + zero-pad to page_size│             │             │     │
└─────────────────────────┴─────────────┴─────────────┴─────┘

Page 0 is 24 bytes of header + zero-padding:

offset	size	field	value
0	16	magic	`"DSE-PAGER-v1\0\0\0\0"` (ASCII + NULs)
16	4	page_size	u32 little-endian
20	4	num_pages	u32 little-endian (includes page 0)
24	rest	zeros	padding to `page_size`

num_pages is the durable page count — what the file claims after fsync. The in-memory pager may have allocated pages beyond that which have not been flushed yet; close()/flush() reconcile them.

Cache, in pictures

        cache_capacity = 3,   recent = [pid=5] [pid=2] [pid=7]
                                MRU             LRU

  read(5)   →   hit, promote 5 to head        [5] [2] [7]
  write(9)  →   miss, evict 7 (writeback)     [9*] [5] [2]      ← 9 dirty
  read(2)   →   hit, promote 2                [2] [9*] [5]
  flush()   →   write 9, fsync                [2] [9 ] [5]

Each frame in the cache carries:

pid: u32
data: Vec<u8> of length page_size
dirty: bool
linked-list pointers (prev / next) into the LRU chain

The lookup table is a hashmap pid → frame_index (Rust) or pid → *list.Element (Go) or pid → list iterator (C++). All three give O(1) lookup; promotion to MRU is O(1) doubly-linked-list splice.

Read path

read(pid):
    if pid == 0 or pid > num_pages_in_memory: panic
    if pid in cache:
        promote cache[pid] to MRU
        cache_hits += 1
        return clone of cache[pid].data
    else:
        cache_misses += 1
        if cache is full:
            evict tail; if dirty, pwrite then mark clean
        buffer = pread(page_size bytes at offset pid * page_size)
        insert (pid, buffer, dirty=false) at MRU
        return clone

Write path

write(pid, bytes):
    assert bytes.len() == page_size
    if pid in cache:
        cache[pid].data = bytes
        cache[pid].dirty = true
        promote to MRU
    else:
        if cache is full: evict tail with write-back as above
        insert (pid, bytes, dirty=true) at MRU       ← no read!

The "write-without-read" path is the optimization that makes bulk loads cheap. A B+-tree splitting a leaf allocates a fresh page and writes the whole 4 KiB at once; reading the old (uninitialized) contents first would double I/O for nothing.

Allocate

allocate():
    pid = num_pages_in_memory
    num_pages_in_memory += 1
    return pid                       (1-based; pid 0 is the header)

The file is extended lazily — only when the page is actually written back (either via eviction or flush). This means a sequence of allocate(); allocate(); allocate() without writes never touches disk, which matters for transactions that roll back.

Flush

flush():
    rewrite page 0 with current num_pages
    for each cached page in ascending pid order:
        if dirty: pwrite at offset pid*page_size; mark clean
    fsync

Sorting by pid before write turns N scattered seeks into one sequential pass on a spinning disk. On SSDs the win is smaller but still real (TLB-friendly access pattern; predictable readahead).

Determinism

The lab's verification depends on every operation being deterministic given the seed, the workload, and the cache capacity. Two things that look like they could leak nondeterminism but do not:

HashMap iteration order. We never iterate the cache map; the flush loop sorts dirty frames by pid first.
fsync timing. fsync does not change the byte contents of the file, only their visibility after a crash. The sha256 we compare is taken from the post-flush file, which is fully determined.

Where this fits

Upstream: none directly; the pager is a from-scratch component.
Downstream: db-12 (SQL frontend storage), db-13 (MVCC over snapshot page versions), db-14 (secondary index B+-trees over the pager), db-15 (SQLite-complete), db-21 (advanced storage variants), and every distributed lab from db-16 onwards stores state on a pager-backed file.

db-11 — References

Primary sources

SQLite Pager design notes — the cleanest public description of a production pager, including how it interacts with rollback journals and WAL. The architecture of the db-11 pager is a deliberate simplification of this design. https://www.sqlite.org/atomiccommit.html https://www.sqlite.org/walformat.html
LMDB / mdb design — Howard Chu, MDB: A Memory-Mapped Database and Backend for OpenLDAP. Describes a B+-tree pager whose write path is copy-on-write rather than write-back. Useful counterpoint to the LRU + dirty-bit approach we took. https://www.symas.com/symas-embedded-database-lmdb
Goetz Graefe, Modern B-Tree Techniques, Foundations and Trends in Databases 3(4), 2010. Chapter 2 covers buffer-pool management and the page-eviction policies real engines use.

Operating-systems background

Andrew S. Tanenbaum, Modern Operating Systems, 4th ed., chapter on file systems and page caches. The OS's own page cache is conceptually our cache; understanding pread/pwrite/fsync at the kernel level explains why "writing" without fsync is not durable.
fsync(2) man page — the canonical answer to "what does fsync actually guarantee?" Read this before assuming a write reached disk.
Eduardo Pinheiro et al., Failure Trends in a Large Disk Drive Population, FAST 2007. Sobering reminder that the device under the pager does fail; durability is a probabilistic claim.

Replacement policies

Elizabeth O'Neil, Patrick O'Neil, Gerhard Weikum, The LRU-K Page Replacement Algorithm For Database Disk Buffering, SIGMOD 1993. Why naive LRU thrashes on scan-heavy workloads, and the fix everyone borrowed.
Theodore Johnson, Dennis Shasha, 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm, VLDB 1994. The 2Q policy used by Postgres and several others.
The db-11 implementation deliberately uses textbook LRU. db-22 (performance) will measure when this hurts and what 2Q / CLOCK / ARC buy.

SQLite (src/pager.c, src/pcache.c) — heavy reading, but the comments are excellent. https://www.sqlite.org/src/file?name=src/pager.c
BoltDB / bbolt (db.go, freelist.go) — small enough to read in an afternoon. https://github.com/etcd-io/bbolt
InnoDB (storage/innobase/buf/) — large, but the buf_pool_t and buf_LRU.cc files are where the buffer-pool policy lives.

Cross-lab dependencies

Upstream: none. The pager is a from-scratch component.
Downstream: db-12 (SQL frontend), db-13 (MVCC), db-14 (indexes), db-15 (SQLite-complete), db-20 (distributed KV) all store state on top of a pager file in this format.

db-11 — Analysis

What had to be decided before any code was written, and why each choice locks in trade-offs the rest of the B-tree track will pay for or be paid by.

Required invariants

File layout is canonical. Byte 0..15 of page 0 is the magic string DSE-PAGER-v1\0\0\0\0; bytes 16..19 are page_size LE; bytes 20..23 are num_pages LE; bytes 24..page_size-1 are zero. Any pager implementation that produces or consumes a file must agree on these bytes to the bit.
Cache capacity is hard. After every operation, the number of resident frames is <= cache_capacity. The eviction path maintains this invariant before admitting a new frame, never after.
Dirty pages survive eviction. If a page is evicted while dirty == true, its bytes are written to disk before the frame is reused. The cache may evict at any time; a dropped dirty page is a data-loss bug.
Determinism. Given (path, page_size, cache_capacity, seed, ops, scenario), the post-flush file bytes are a pure function of those inputs. Two languages running the same workload must produce sha256-identical files.
Page 0 is reserved. User code receives only pid >= 1 from allocate(). read(0) / write(0) is undefined behaviour (panic in Rust; documented but unenforced in Go/C++).

Design decisions

Why a 16-byte magic instead of 8

8 bytes (e.g. DSEPAGER) would have fit one register and saved 8 bytes per file. 16 bytes lets us include a version suffix and a human-readable prefix that shows up in strings(1). The cost is trivial; the debugging payoff (file db.bin | grep DSE) is immediate.

Why fixed page size at open() rather than per-page

A real engine fixes page size when the database file is created and refuses to mount it under a different page size. We bake this in by writing the page size into the header and re-reading it on open. The cost: changing page size means rewriting the file. The gain: no per-page metadata, no alignment surprises, page offsets are just pid * page_size.

Why 1-based page ids

Page 0 is the header. Letting allocate() return 0 would force every caller to remember the "0 is reserved" rule and to check it on every dereference. By starting allocation at 1, the contract is enforced by construction: any pid you legitimately hold is safe to read.

Why LRU (and not CLOCK, 2Q, ARC, LFU, …)

LRU is the textbook policy and the easiest one to verify deterministic across three languages. Its weakness — sequential scans flush a hot working set — is real but invisible at the cache sizes our tests use (capacity 8 over 100 pages). db-22 will revisit and measure; until then, simplicity dominates.

Why a doubly-linked list, not a `BTreeMap<LastUsed, PageId>`

A balanced map gives O(log n) operations and self-orders by recency. A doubly-linked list plus a hashmap gives O(1) operations and the same eviction order, at the cost of one extra pointer per frame. For a cache of 1000 frames the difference is ~10x in cache hit latency. Worth the boilerplate; LMDB, Postgres, SQLite, RocksDB, InnoDB all use the list-plus-map shape.

Why write-back, not write-through

Write-through (every write() synchronously persists) is simpler but makes random updates ~100x slower because every dirty page costs a seek and an fsync. Write-back lets us batch many writes to the same page (db-10's B-tree insert may rewrite the same node several times during a single workload) and amortize one disk write per page per flush. The tax is the dirty-page accounting, which is enforced by invariant 3 above.

Why fsync only on `flush()`

The pager's user owns the durability story. SQLite calls flush at every COMMIT; an LSM (db-05) calls it after every WAL append; an embedded counter store might call it once a minute. Pushing the decision up keeps the pager honest: it never claims durability it cannot deliver. The cross-test scenarios all call flush() exactly once at the end, which is why their hashes are stable.

Why write-without-read on a cache miss

If write(pid, bytes) evicts a clean page and admits (pid, bytes, dirty=true) without first reading the old contents, the disk's bytes are overwritten entirely on the next eviction or flush. This is safe because write requires bytes.len() == page_size — the whole page is supplied. Reading the old contents first would be a 4 KiB I/O for data we throw away immediately. A proper engine extends this with "page allocation hints" so that the OS can skip the readahead too; we don't bother.

Why the workload uses SplitMix64 (the same PRNG as db-10)

Three reasons:

Identical implementation across languages. Three lines of wrapping-add and xor-mul; if any language gets it wrong the sha256 changes on the very first scenario.
No external dependency. Crypto-quality PRNGs would need different libraries per language; SplitMix64 is purely arithmetic.
Consistency across the track. Reusing the same PRNG as db-10 means a future cross-lab test can compare hashes from "B-tree built in RAM" against "B-tree built on the pager" using the same key sequence.

The PRNG draws exactly one u64 per iteration and uses specific bit-slices for op/byte/pid. A variable number of draws per iteration would make scenarios diverge in their key streams, which would defeat purpose 3.

Why the scenarios are sequential / random / mixed

sequential stresses the readahead-friendly path: page ids walk in monotonic order, cache hits dominate, evictions are predictable.
random stresses the eviction path: cache hit ratio is the cache_capacity / num_pages ratio, evictions happen on most writes, dirty pages move through the cache constantly.
mixed is what real workloads look like: a hot subset (selected by (r>>60)&1) plus a long tail of cold pages.

These three together exercise the entire cache state machine. If any of them diverges across languages, the bug is localized (sequential bugs are accounting; random bugs are eviction; mixed bugs are interaction).

Tradeoffs worth flagging

No free-list, ever. allocate() only grows the file. Once a B+-tree splits a page and later coalesces it, the now-unused page id is leaked. db-21 (storage engine advanced) will reclaim via a free-list page; here it would just be unverifiable code.
Vec<u8> per frame. Every cached page is its own allocation. A real engine packs frames into a single arena (the buffer pool) and indexes by offset. db-22 will measure the difference and likely arena-allocate.
No checksums. A corrupted page returns its corrupted bytes silently. db-15 will add a CRC32 to the page footer when SQLite semantics demand it.
No mmap. mmap-backed pagers (LMDB) are dramatically simpler but inherit the OS's page-replacement decisions, which we want to control here for testability. db-21 may explore the mmap variant.
Single-threaded. No latching, no per-page reader/writer locks. db-13 (MVCC) and db-17 (Raft) will introduce concurrency on top of this layer.

db-11 — Execution

What was built, in the order it was built.

1. Rust (`src/rust`)

Cargo.toml declares lib crate pager11 and binary pagerctl. Edition 2021; release profile lto = "thin", codegen-units = 1.
src/lib.rs contains:
- Constants MAGIC: &[u8;16] = b"DSE-PAGER-v1\0\0\0\0", HEADER_LEN = 24.
- Frame { pid, data: Vec<u8>, dirty, prev: Option<usize>, next: Option<usize> } plus Pager { file, page_size, num_pages, capacity, frames: Vec<Frame>, free: Vec<usize>, map: HashMap<u32,usize>, head/tail: Option<usize>, hits, misses }.
- Pager::open(path, page_size, capacity), ::allocate(), ::read(pid), ::write(pid, bytes), ::flush(), ::cache_hits(), ::cache_misses(), ::num_pages().
- LRU helpers promote(frame_idx), evict_tail(), admit(...) operating on the indexed doubly-linked list.
- Hand-rolled SHA-256 (FIPS 180-4) so the lib has no dependencies. sha256_hex(bytes) and sha256_file(path).
- SplitMix64 PRNG and run_workload(path, page_size, capacity, pages, ops, seed, scenario) -> Pager.
- 10 inline #[cfg(test)] tests: header round-trip, allocate monotonic, read-after-write within and across eviction, dirty survives eviction, flush is idempotent, hits/misses counts, scenario determinism (sequential), scenario determinism (random), scenario determinism (mixed), SHA-256 empty-string test vector.
src/bin/pagerctl.rs: order-independent arg parser (args.windows(2) lookup). Subcommands init <path> [--page-size N] and workload <path> --seed S --ops N --pages P --cache C --scenario {sequential|random|mixed} [--page-size N]. Workload prints sha256_file(path) to stdout with no trailing newline.

2. Go (`src/go`)

go.mod module github.com/10xdev/dse/db11, Go 1.22.
pager.go ports the Rust API one-for-one. Uses container/list for the LRU chain and map[uint32]*list.Element for lookup. SHA-256 via crypto/sha256 (stdlib is fine; the cross-language comparison is on the file bytes, not the digest algorithm).
pager_test.go mirrors the 10 Rust tests plus an 11th, TestWorkloadMatchesCanonicalHashes, that bakes in the three canonical hashes (A/B/C) and runs all three scenarios in a loop. This is the test that catches "Go silently disagrees with Rust" regressions before the cross_test script even runs.
cmd/pagerctl/main.go is the matching CLI. Custom flag parser (findFlag, firstPositional, mustU64, mustInt) because flag.Parse stops at the first non-flag argument and the shared script passes <path> before the flags.

3. C++ (`src/cpp`)

CMakeLists.txt builds:
- pager11 (static lib from src/pager.cc + src/sha256.cc).
- pagerctl (executable linking pager11).
- test_pager11 (ctest target linking pager11).
- Flags: -Wall -Wextra -Wpedantic -Werror -O3 -DNDEBUG in Release.
src/pager.h, src/pager.cc: factory function Pager::open(...) returning a std::unique_ptr<Pager>. std::list<Frame> for the LRU chain; std::unordered_map<uint32_t, std::list<Frame>::iterator> for O(1) lookup. std::list::splice for promotion.
src/sha256.h, src/sha256.cc: FIPS 180-4 SHA-256 in ~120 lines.
src/pagerctl.cc: matching CLI. Includes <unistd.h> for getpid() (used by tests for unique tmp paths); the omission of that header was the only build error during initial bring-up.
tests/test_pager11.cc mirrors the Rust tests; uses #undef NDEBUG before <cassert> so asserts fire under Release. Prints OK 11 tests on success. Wired into ctest as a single test case.

4. Scripts

scripts/verify.sh:
1. Rust: cargo test --quiet.
2. Go: go test ./....
3. C++: cmake -S … -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j && ctest --test-dir build --output-on-failure.
4. Exits 0 only if all three are green; prints === OK ===.
scripts/cross_test.sh:
1. Builds Rust/Go/C++ pagerctl binaries (cargo release, go build, cmake+make).
2. Scenario A sequential, seed=42, pages=32, cache=8, ops=200, page_size=256: pagerctl workload <tmp> … per language; sha256
  - size comparison against baked-in expected hash.
3. Scenario B random, seed=7, pages=64, cache=8, ops=500, page_size=256: same shape, different hash.
4. Scenario C mixed, seed=2024, pages=128, cache=16, ops=1000, page_size=512: same shape, different hash.
5. Spot-check: read the first 20 bytes of Scenario A's file and assert they equal 4453452d50414745522d76310000000000010000 (magic DSE-PAGER-v1\0\0\0\0 + 0x00000100 for page_size = 256 LE).
6. Print === ALL OK ===.

What was deliberately not built

Free-list / page reclamation. allocate() only grows the file. db-21 (storage engine advanced) introduces a free-list page.
Page checksums. No CRC32 footer. db-15 will add one when SQLite-compatibility demands it.
mmap backend. All I/O goes through pread/pwrite. An mmap-based variant is a possible db-21 follow-up.
Concurrency. No latches; the pager assumes a single thread. db-13 (MVCC) and db-17 (Raft) introduce concurrent access at higher layers.
WAL. db-11's pager has no WAL; durability is via in-place write + fsync at flush(). db-03 already covered WAL and db-13 will add a transactional WAL on top of the pager.
Compression / encryption. Out of scope; the page bytes are whatever the caller wrote.

db-11 — Observation

What the cross-language verification actually proves, and what the file looks like by hand.

Output of `scripts/cross_test.sh`

=== compare Scenario A (sequential seed=42 pages=32 cache=8 ops=200 ps=256) ===
  A          rust=cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 (    8448 B)
  A          go  =cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 (    8448 B)
  A          cpp =cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428 (    8448 B)
  match(A): cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428
=== compare Scenario B (random seed=7 pages=64 cache=8 ops=500 ps=256) ===
  B          rust=3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 (   16640 B)
  B          go  =3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 (   16640 B)
  B          cpp =3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023 (   16640 B)
  match(B): 3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023
=== compare Scenario C (mixed seed=2024 pages=128 cache=16 ops=1000 ps=512) ===
  C          rust=5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 (   66048 B)
  C          go  =5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 (   66048 B)
  C          cpp =5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460 (   66048 B)
  match(C): 5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460
=== spot-check header ===
  spot-checks ok
=== ALL OK ===

File sizes are exactly (pages + 1) * page_size:

A: (32 + 1) * 256 = 8448
B: (64 + 1) * 256 = 16640
C: (128 + 1) * 512 = 66048

The +1 is page 0 (header).

Reading the header by hand

For Scenario A (page_size = 256, num_pages = 33 including the header):

xxd -l 24 /tmp/pager-A.rust.bin

00000000: 4453 452d 5041 4745 522d 7631 0000 0000  DSE-PAGER-v1....
00000010: 0001 0000 2100 0000                      ....!...

Decoded:

bytes	meaning
`44 53 45 2d 50 41 47 45 52 2d 76 31 00 00 00 00`	magic `DSE-PAGER-v1\0\0\0\0`
`00 01 00 00`	page_size = `0x00000100` = 256 (LE)
`21 00 00 00`	num_pages = `0x00000021` = 33 (LE)

Bytes 24..255 are zero (header padding to page_size).

The cross-test's spot-check confirms bytes 0..19 exactly equal 4453452d50414745522d76310000000000010000. Any single-byte change to the format would surface here, and would break the sha256 match across all three languages, and would change the file size, and would invalidate the canonical hashes table — four independent failure signals for one bug.

Reading a data page by hand

For Scenario A the workload writes a known byte value B = (r >> 24) & 0xFF to every byte of the chosen page. So any non-zero data page in /tmp/pager-A.rust.bin should be 256 identical bytes:

xxd -s 256 -l 256 /tmp/pager-A.rust.bin | head -2

00000100: 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c  ................
00000110: 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c 8c8c  ................

A run of one byte value repeated 256 times. Different pages contain different fill bytes; the sha256 of the file rolls all of them up. This makes hand-debugging a divergence between languages straightforward: dump both files, diff -u <(xxd a) <(xxd b), and the first non-matching page tells you exactly which (pid, byte) the languages disagreed on.

Cache statistics (informal)

Running Scenario B with cache = 8 over pages = 64:

hits   = ~190
misses = ~310

Hit ratio ~38%, consistent with the random scenario's expected cache_capacity / num_pages baseline (8 / 64 = 12.5%) plus a small temporal-locality bump. The Rust unit tests assert hits + misses == ops but not the exact ratio, because writes that bypass reads (write-on-miss admission) keep the absolute counts implementation-defined enough that an exact check would be fragile. The file bytes, however, are not implementation-defined — and that is what we pin.

db-11 — Verification

Prerequisites

macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake ≥ 3.20.
Rust toolchain ≥ 1.74.
Go ≥ 1.22.
shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-11-pager-system
scripts/verify.sh        # unit tests, all three languages
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-11-pager-system/src/rust
cargo test --quiet
cargo build --release

Expected: all 10 inline tests pass; target/release/pagerctl is built.

Go

cd db-11-pager-system/src/go
go test ./...
go build ./cmd/pagerctl

Expected: ok github.com/10xdev/dse/db11 <duration>. The TestWorkloadMatchesCanonicalHashes test is the most important; it fails loudly if Go disagrees with Rust on any of the three scenarios.

C++

cd db-11-pager-system/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and the test_pager11 target prints OK 11 tests.

What "green" means

A green run guarantees:

All inline unit tests pass in Rust, Go, and C++.

The cross-language test produces byte-identical files for all three canonical scenarios:

scenario	type	seed	pages	cache	ops	psz	sha256
A	sequential	42	32	8	200	256	`cbac0289ce1eb784e5bd80ab1298c3f9677f1aeb3cfdb09ce78d6796c43b9428`
B	random	7	64	8	500	256	`3405654fd750bffa933c2d1b590160fcbf8ec446f261cc25c5c04c8c0c3dd023`
C	mixed	2024	128	16	1000	512	`5b10acb3e9cf57e3b314c17dc9fa122d79caac6a46501c71875374f9d6720460`

Matching sha256 across three independent implementations proves agreement on:

the file format (header magic, page_size, num_pages encoding),
the SplitMix64 PRNG (constants and bit-extraction layout),
the workload state machine (op/pid/byte selection),
the cache admission rule (write-on-miss admits without read),
the eviction rule (LRU tail, dirty pages written back),
the flush order (dirty pages sorted by pid before write),
and the final on-disk page layout.

The spot-check confirms the first 20 bytes of Scenario A's file are 4453452d50414745522d76310000000000010000 (magic + page_size = 256 LE), guarding against the regression where all three languages "successfully" agree on a wrong header.

When verification fails

Cross-language sha256 mismatch on Scenario A only — the sequential scenario exercises the simplest code path (no random pid selection, predictable evictions). A failure here is almost always either:
- the magic / header encoding (check the spot-check first), or
- the SplitMix64 PRNG (re-derive the first 5 outputs by hand and compare against 0xe220a8397b1dcdaf, …).
Scenario A matches, B fails — the random scenario stresses eviction. Look for off-by-one in LRU tail selection or for a language whose unordered_map iteration leaks into the flush order (it should not; flush sorts by pid).
A and B match, C fails — the mixed scenario uses a larger cache and a larger page size; suspect a page-size assumption baked into the implementation (e.g., a hard-coded 256 instead of reading from the header).
All three sha256s match each other but disagree with the table above — a legitimate algorithm change. Make sure it was intentional, then update cross_test.sh, the Go canonical-hashes test, the C++ canonical-hashes assertion, and the table above in the same commit.
One language's unit tests pass but cross_test fails — almost always a CLI bug, not a library bug. The unit tests drive the library directly; the cross_test drives the binary through the shell. Double-check argument parsing: in particular, that <path> may appear before the --flags (this is the bug the Go port hit during bring-up, fixed by the custom findFlag/firstPositional parser).

db-11 — Broader Ideas

Where this disk-backed pager fits in the rest of the track, and which real-world techniques live one or two steps beyond it.

Immediate next labs

db-12 — SQL frontend. The first consumer of the pager. A row in a table becomes some bytes inside some page; an INSERT is a Pager::write. The B+-tree layer that maps rows-to-pages is built in db-15 but its scaffolding starts here.
db-13 — Transactions and MVCC. Each transaction sees a consistent snapshot of the pager. The simplest implementation is copy-on-write at the page level: a write conceptually allocates a new page rather than mutating the old, and snapshots hold roots pointing at the version they read. Our pager's monotonic allocate() is the right primitive for this.
db-14 — Indexes. Secondary indexes are additional B+-trees living in the same pager file as the primary. Multiple trees, one pager, one buffer pool.
db-15 — SQLite-complete. Stitches db-10..db-14 together. Will add page checksums, the rollback journal or WAL, and the free-list page so that deleted pages don't leak.
db-21 — Storage engine advanced. Revisits this pager with CLOCK / 2Q eviction, a freelist, an mmap variant, and possibly a group-commit fsync scheduler.
db-22 — Performance and benchmarking. Measures hit ratio, eviction rate, and fsync cost under realistic workloads; compares LRU against alternative policies.

How this lab's pieces show up in real systems

The 4 KiB page is the de-facto default in every major engine (Postgres, SQLite, InnoDB, RocksDB SST blocks). It matches both the typical filesystem block size and the Linux page-cache granule, which means partial pages cost no extra readahead.
The header-on-page-0 trick is universal: SQLite, BoltDB, InnoDB, even Berkeley DB all reserve page 0 for metadata.
Write-back with LRU is the classic buffer-pool design; Postgres calls it shared_buffers, InnoDB calls it innodb_buffer_pool_size, SQLite calls it the page_cache. Our implementation is the textbook version they all started from.
fsync-only-on-flush is the contract every transactional engine demands of its pager: the WAL or rollback journal layer above decides when, the pager just provides the primitive. The DBMS literature calls this the "no-force" policy.
The doubly-linked-list + hashmap LRU is the pattern in every production buffer pool — Postgres's BufferLookup, InnoDB's buf_LRU, RocksDB's LRUCache, even your CPU's L2 replacement logic. The textbook is real.

Variants worth implementing later

CLOCK replacement — a single circular array with a reference bit per frame. Approximates LRU at lower overhead because there's no list splice per access. PostgreSQL uses this.
2Q — two LRU lists, one for "seen once" and one for "seen twice or more". Resists scan-induced cache pollution. Cheap to implement on top of the existing LRU code.
ARC (Adaptive Replacement Cache) — IBM's adaptive variant of 2Q. Patented but reimplementable.
Copy-on-write pages (LMDB-style) — every write allocates a fresh page; old versions stay live for concurrent readers. Trades higher write amplification for free MVCC.
mmap-backed pager — mmap the whole file, let the OS manage the page cache. Drastically simpler code; loses control over eviction and durability.

Performance experiments worth running later

Plot hit ratio vs cache_capacity / num_pages for each scenario. Expect a knee around 25..50% for the mixed scenario.
Measure the cost of flush() as a function of dirty-page count. Sorted writes should be sub-linear vs unsorted on spinning disk.
Compare write-back vs write-through latency for a steady stream of small updates. The write-back win should be order-of-magnitude on any device with non-trivial fsync cost.
Vary page_size from 256 B to 64 KiB. The hit ratio improves with smaller pages (finer caching granule) but per-operation bookkeeping cost grows.

What "production-quality" would require beyond this lab

Crash recovery. Right now, a crash in the middle of a flush leaves a half-written page on disk and no way to detect it. SQLite uses a rollback journal; Postgres uses WAL + a checkpointer. db-13 will introduce the simplest form of this.
Checksums. A CRC32 footer per page so torn writes are detectable, not silently returned to the caller.
A free-list page so deleted pages can be reused, otherwise files grow monotonically.
Concurrent access. Reader-writer latching at the page level so the pager scales to multiple threads.
Direct I/O / O_DIRECT to bypass the OS page cache and prevent double-buffering. Needed at high throughput; subtle to get right.
Async I/O. io_uring on Linux, IOCP on Windows. The synchronous pread/pwrite we use is fine for teaching and for any workload where the database is not the bottleneck.

db-11 step 01 — Page I/O and file layout

Goal

Build the bottom half of the pager: the file format and the uncached read / write / allocate path. No cache, no LRU, no eviction. Every read is a pread; every write is a pwrite; flush is just fsync.

Tasks

Define MAGIC = b"DSE-PAGER-v1\0\0\0\0" (16 bytes) and HEADER_LEN = 24.
Implement Pager::open(path, page_size, capacity):
- If file does not exist or is empty, create it; write a fresh header page (magic + page_size + num_pages=1, zero-padded to page_size); fsync.
- If file exists, read bytes 0..24, validate magic, parse page_size and num_pages. The caller-supplied page_size argument must match the on-disk value (or be supplied as the authoritative size on creation).
Implement Pager::allocate() -> u32:
- return num_pages, then num_pages += 1. The on-disk file is not yet extended — the next flush() will rewrite page 0 and the new page will materialise then.
Implement Pager::read(pid) -> Vec<u8> (no caching yet):
- validate 1 <= pid < num_pages.
- pread(page_size bytes at offset pid * page_size).
Implement Pager::write(pid, bytes) (no caching yet):
- validate bytes.len() == page_size.
- validate 1 <= pid < num_pages.
- pwrite(bytes at offset pid * page_size).
Implement Pager::flush():
- rewrite page 0 with current num_pages (handles allocate-only transactions).
- fsync.
Implement Pager::close():
- flush() then drop the file handle.

Acceptance

Inline unit tests:

header_round_trip — open new file, close, reopen, assert num_pages == 1 and the magic is intact.
allocate_monotonic — three allocate() calls in a row return 1, 2, 3.
write_then_read_same_pager — allocate, write a known byte pattern, read it back, assert equal.
write_then_reopen_then_read — allocate, write, flush(), drop, reopen, read; bytes survived.
flush_extends_file — after allocate + write + flush, file size equals (num_pages) * page_size.

All three green in Rust, Go, and C++.

Discussion prompts

Why is num_pages stored on page 0 rather than inferred from the file size? (Hint: what happens between allocate() and flush() if the OS crashes?)
What goes wrong if open() is called concurrently from two processes on the same file?
Why does flush() rewrite page 0 even if no data page changed?

db-11 step 02 — LRU cache with write-back

Goal

Add the in-memory page cache on top of step 01. Bounded capacity, LRU eviction, write-back on eviction, dirty bit per frame. After this step the disk is touched only on cache misses, evictions, and flush().

Tasks

Define Frame { pid: u32, data: Vec<u8>, dirty: bool, prev, next } where prev/next are indices into a Vec<Frame> (Rust) or *list.Element (Go) or std::list<Frame>::iterator (C++).
Add to Pager:
- capacity: usize — set at open().
- frames: Vec<Frame> — the storage backing the LRU chain.
- free: Vec<usize> — reusable indices after eviction.
- map: HashMap<u32, usize> — pid → frame index.
- head, tail: Option<usize> — MRU and LRU ends.
- hits, misses: u64 — accounting.
Helpers:
- promote(idx) — unlink frame from current position, insert at head. No-op if already at head.
- unlink(idx) — remove frame from the list.
- evict_tail() — pop the LRU frame; if dirty, pwrite before reusing the slot.
- admit(pid, data, dirty) — if at capacity, evict_tail first; allocate a frame (from free or push new); insert at head; update map.
Rewrite read(pid):
- if map[pid] exists: promote, hits += 1, clone, return.
- else: misses += 1, pread, admit(pid, data, dirty=false), clone, return.
Rewrite write(pid, bytes):
- if map[pid] exists: overwrite data, set dirty = true, promote.
- else: admit(pid, bytes, dirty=true) — no pread.
Rewrite flush():
- collect all (pid, frame_idx) where dirty == true.
- sort by pid ascending.
- for each, pwrite at pid * page_size, set dirty = false.
- rewrite page 0 with current num_pages.
- fsync.

Acceptance

Inline unit tests:

cache_hit_does_not_pread — write then read twice; second read produces a cache hit (cache_hits >= 1).
eviction_writes_back_dirty — fill cache + 1, evict the oldest frame, drop the pager, reopen, read the evicted pid, bytes match the value written before eviction.
eviction_skips_clean_pages — fill cache with only-reads, evict, reopen: file size unchanged (no spurious writes).
flush_is_idempotent — flush twice in a row, file bytes identical, both succeed.
hits_misses_accounting — for a known sequence of operations, cache_hits + cache_misses equals the number of read calls (writes that hit the cache are not counted as reads).

All three green in Rust, Go, and C++.

Discussion prompts

Why does write on a miss not do a pread? When could this be wrong? (Answer: never, as long as the caller writes the whole page. Partial-page writes would need read-modify-write.)
Why sort dirty pages by pid before writing them out?
What is the worst-case eviction cost, and how could evict_tail amortize fsyncs across many evictions?

db-11 step 03 — Cross-language byte agreement

Goal

Pin the file format. After this step a workload run in Rust, Go, or C++ produces sha256-identical files for the same inputs. This is what makes the pager a real cross-language contract, not three loosely-related implementations.

Tasks

Implement SplitMix64 exactly:

next(state):
    state += 0x9E3779B97F4A7C15            // wrapping
    z = state
    z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
    z = (z ^ (z >> 27)) * 0x94D049BB133111EB
    return z ^ (z >> 31)

All multiplies are wrapping u64. Test against a known first-output table (seed = 0 yields 0xE220A8397B1DCDAF etc.).

Implement run_workload(path, page_size, capacity, pages, ops, seed, scenario):

pager = Pager::open(path, page_size, capacity)
while pager.num_pages() < pages + 1:
    pager.allocate()
rng = SplitMix64(seed)
for _ in 0..ops:
    r = rng.next()
    op       = (r >> 62) & 0b11           // 0,1,2,3
    byte_val = (r >> 24) & 0xFF
    pid = match scenario:
        sequential -> 1 + (iteration % pages)
        random     -> 1 + (r as u64 % pages)
        mixed      -> if (r >> 60) & 1 then random_pid else sequential_pid
    match op:
        0 | 1 -> write a page of [byte_val; page_size]
        2     -> read pid (discard result)
        3     -> skip
pager.flush()
return pager

Critical: each iteration consumes exactly one next() call. This is what keeps the three scenarios comparable for a given seed.

Build a pagerctl CLI in each language with subcommands init and workload. workload runs the function above and prints sha256_file(path) in lowercase hex with no trailing newline to stdout. The CLI must accept <path> either before or after the --flags — the cross-test passes path first; some contributors will pass it last.
Write scripts/cross_test.sh:
- build all three binaries (cargo release, go build, cmake+make).
- for scenarios A (sequential), B (random), C (mixed): run each language, sha256 the resulting file, assert all three match each other and match the baked-in expected hash.
- spot-check the first 20 bytes of one file equal the expected header bytes.
Bake the canonical hashes into the Go and C++ test suites too, so a divergence is caught at go test / ctest time even without running the shell script.

Acceptance

scripts/verify.sh exits 0; each language reports its tests green.
scripts/cross_test.sh exits 0 with === ALL OK ===.
The canonical hashes table in docs/verification.md matches the hashes hard-coded in:
- scripts/cross_test.sh
- src/go/pager_test.go::TestWorkloadMatchesCanonicalHashes
- src/cpp/tests/test_pager11.cc (canonical hashes block)

Discussion prompts

What happens to the sha256 of Scenario A if you swap the order of the two multiplies in SplitMix64?
Why does the workload draw exactly one next() per iteration, even for the skip case? (See docs/analysis.md.)
If we wanted to add a fourth scenario (e.g. "read-heavy"), what would have to change in this lab to keep the cross-test working?

db-12 — SQL Frontend

What is it?

A self-contained SQL frontend: a tokenizer + recursive-descent parser that turns a small but realistic SQL dialect into an Abstract Syntax Tree, plus a canonical byte serializer that hashes deterministically. There is no execution engine in this lab — the AST stops at bytes-on-disk and bytes-on- the-wire.

The supported dialect is the smallest one that's still interesting:

CREATE TABLE name (col TYPE, …); with INT and TEXT columns.
INSERT INTO name VALUES (…), (…), …; with single-row and multi-row form.
SELECT * | col, col, … FROM name [WHERE col OP literal];
DELETE FROM name [WHERE col OP literal];
UPDATE name SET col = lit, col = lit, … [WHERE col OP literal];

with six comparison operators (=, !=, <, <=, >, >=), integer and text literals ('pad''let' style escape), -- line comments, and case-insensitive keywords. Identifiers are preserved verbatim.

Execution — the bytecode VM that walks the AST to actually run the statement — is deliberately deferred to db-13 (where it can share the transaction machinery it really needs). This lab stops at "the program parsed to this exact AST, and we can prove it byte-for-byte across three languages".

Why does it matter?

This is the lab where the project pivots from storage to language. Every SQL database in the world starts with the same three-stage front:

source text ──► tokens ──► AST ──► (planner / VM / executor)

What we are doing here is exactly steps one and two, plus a fourth step — serialize the AST to a canonical byte stream — that no production engine needs but the project needs as the only honest cross-language proof that three independent parsers agree on the meaning of the same SQL text.

If you've ever read SQLite's tokenize.c or Postgres's scan.l / gram.y, this lab is the same shape, written by hand:

Tokenizing is a single character-by-character pass with a handful of state branches (whitespace, comment, identifier, number, string, operator, single-char punct).
Parsing is recursive descent: one function per non-terminal, peek at the next token, dispatch, recurse. No parser generators, no table-driven state machines, no lookahead arithmetic — the grammar is tiny enough that the code is almost a 1:1 transliteration of the BNF.
The AST is a discriminated union (Rust enum, Go field-bag struct, C++ struct with a kind tag). Statements know their kind; literals know their type.

Once you've built one frontend by hand, every other one becomes a reading exercise.

How does it work?

            ┌──────── source text (UTF-8 bytes) ────────┐
            │                                            │
   tokenize │  char loop: ws | -- comment | ident |     │
   ─────────►│  number | string | op | punct             │
            │  → Vec<Token { kind, payload, line, col }>│
            │                                            │
   parse    │  recursive descent:                        │
   ─────────►│    parse_program        = stmt* EOF       │
            │    parse_stmt            dispatches on kw  │
            │    parse_create/insert/select/delete/update│
            │  → Vec<Statement>                          │
            │                                            │
   serialize│  walk AST, emit canonical bytes (see       │
   ─────────►│  "wire format" below). Magic header lets  │
            │  decoders sanity-check.                    │
            │  → Vec<u8>                                 │
            │                                            │
   sha256   │  inline FIPS 180-4 (Rust + C++);           │
   ─────────►│  crypto/sha256 (Go). Output hex matches   │
            │  in all three languages on any input.      │
            └────────────────────────────────────────────┘

Wire format

Magic header "DSESQL01" (8 ASCII bytes), then u32 LE statement count, then that many statement records:

Statement record:
  u8 kind           1=Create, 2=Insert, 3=Select, 4=Delete, 5=Update
  u32 LE name_len
  name bytes

  if Create:
    u32 LE col_count
    repeat col_count: { u32 LE name_len, name bytes, u8 type (1=Int|2=Text) }

  if Insert:
    u32 LE row_count
    repeat row_count:
      u32 LE col_count
      repeat col_count: literal

  if Select:
    u8 cols_kind     1 = *, 0 = named
    if named: u32 LE n, repeat n: { u32 LE name_len, name bytes }
    where

  if Delete:
    where

  if Update:
    u32 LE set_count
    repeat set_count: { u32 LE name_len, name bytes, literal }
    where

literal:
  u8 tag            1 = Int, 2 = Text
  if Int:  i64 LE (two's-complement, little-endian)
  if Text: u32 LE n, n bytes

where:
  u8 has_where      0 = no WHERE, 1 = WHERE
  if 1:
    u32 LE col_name_len, col_name bytes
    u8 op           1=Eq, 2=Ne, 3=Lt, 4=Le, 5=Gt, 6=Ge
    literal

Every integer is unsigned-little-endian unless noted. Strings carry their own length prefix (no null-terminators, no escapes — the bytes are exactly what the parser saw between the unescaped quotes).

Error format

Every error message is one line of the form:

parse error at line L col C: <message>

Lines and columns are 1-based. tokenize errors report the position of the bad character; parse errors report the position of the offending token (or one past the last token's column if the input ended early).

What's intentionally out of scope

Execution. No VM, no query plan, no I/O. db-13.
JOIN, GROUP BY, ORDER BY, LIMIT, expressions. Single-table predicates only. Future labs.
Schema validation. A SELECT name FROM t referencing an undefined column parses cheerfully; that's the planner's job, not the parser's.
Identifier case folding. SQLite folds Users and users together; Postgres folds them to lowercase. We do neither — identifiers are preserved verbatim, only keywords are case-insensitive. This makes the byte-identity test sharper.
Quoted identifiers ("foo"), backticks, square brackets. One identifier syntax keeps the tokenizer trivial.
Negative literals as expressions. A leading - before an integer literal in a value position is parsed as a sign on that literal; it is not a unary-minus operator. There is no general expression grammar.

Cross-language invariant

All three implementations expose sqlctl parse --file FOO.sql (or --inline "..."). Stdout receives the canonical bytes; stderr receives the sha256 hex (no trailing newline). scripts/cross_test.sh runs both fixtures through all three binaries and asserts:

All three stderr-emitted sha256s match.
The matching hash equals the frozen-in-tests value (so the wire format cannot silently drift even if all three implementations drift together).
The bytes themselves are bit-identical (cmp -s) — guarding against a hypothetical sha256 collision.
The error path also matches: feeding "SELECT FROM t;" to all three binaries must produce a non-zero exit and an error line that mentions the column.

The frozen reference hashes are:

Fixture	Bytes	sha256
`a_basic.sql`	181	`071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15`
`b_full.sql`	486	`e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962`

Any change to the AST shape, tokenizer behavior, or wire format must update those numbers in scripts/cross_test.sh, the Go test (sql_test.go), the C++ test (tests/test_sqlfront12.cc), and this table — all in the same commit.

db-12 — References

Primary sources

Crafting Interpreters, Robert Nystrom — chapters 4 ("Scanning") and 6 ("Parsing Expressions") map almost 1:1 onto what we built. The hand- rolled recursive-descent style and the "one function per non-terminal" discipline are taken straight from this book. https://craftinginterpreters.com/
Modern Compiler Implementation in ML (or in C / Java), Andrew Appel — chapter 3 ("Parsing"). The clearest exposition of why recursive descent works for LL(1) grammars and what changes when you need lookahead or precedence climbing.
The Dragon Book (Aho, Lam, Sethi, Ullman) — chapters 3 and 4. The textbook source for lexical analysis (DFA construction, regular expressions to scanners) and predictive parsing. Read for theory; the practice is in Crafting Interpreters.

How real databases parse SQL

SQLite — src/tokenize.c (hand-rolled DFA, conceptually the same as ours but written in C with a generated keyword-lookup function) and src/parse.y (a Lemon-parser grammar, which generates a table-driven bottom-up parser rather than our recursive-descent). https://github.com/sqlite/sqlite/blob/master/src/tokenize.c https://www.sqlite.org/lemon.html
PostgreSQL — src/backend/parser/scan.l (flex) and src/backend/parser/gram.y (bison). A generator-based front; the grammar file alone is ~17k lines. Worth opening just to see the scale of the dialect they support. https://github.com/postgres/postgres/tree/master/src/backend/parser
DuckDB — uses libpg_query, which is a stripped-down Postgres parser exposed as a library. A useful pattern when you want Postgres-compatible SQL without the rest of Postgres. https://github.com/duckdb/duckdb/tree/main/third_party/libpg_query
CockroachDB — pkg/sql/parser/sql.y (goyacc). Like DuckDB above, another example of "real database, generated parser".

Recursive descent specifically

Rob Pike, Lexical Scanning in Go, GopherCon 2011 — the talk that popularized the "scanner emits tokens to a channel; parser reads from the channel" style. We don't use channels (we just return a Vec), but the state-machine framing of the scanner loop is the same. https://www.youtube.com/watch?v=HxaD_trXwRE
Doug Crockford, Top Down Operator Precedence, 2007 — the cleanest explanation of Pratt parsing, which is what you reach for next once recursive descent runs into expression-precedence pain. We don't need it here (no expression grammar) but it's the natural follow-up. https://crockford.com/javascript/tdop/tdop.html

Determinism / wire formats

Federal Information Processing Standards Publication 180-4, NIST, 2015 — the SHA-256 specification we implemented inline in Rust and C++. https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
Protobuf encoding docs — useful counterexample for what not to do if you want byte-identity. Protobuf's "unknown field" handling and optional canonicalization are exactly the corners that prevent stable hashes across implementations. Our format avoids those corners on purpose. https://protobuf.dev/programming-guides/encoding/

Cross-lab dependencies

None. db-12 is intentionally self-contained: there is nothing in earlier labs (storage, WAL, LSM, B-tree) that the parser needs, and the AST serializer uses no upstream wire format. The C++ build does not add_subdirectory(../db-NN). That isolation is a feature, not an oversight — it keeps the lab small enough to be reasoned about as a self-contained compiler-front exercise.

db-12 feeds db-13 (execution + transactions), where the AST will finally be walked by a VM.

db-12 — Analysis

We are building a hand-written SQL frontend in three languages and proving agreement byte-for-byte. The hard part is not any single piece — tokenizing, parsing, and emitting bytes are all small, well-understood components — but holding all three implementations to a single set of design decisions tight enough that the output hashes match on every input.

Required invariants

Deterministic encoding. Given the same input text, the serializer must produce exactly the same byte sequence on every run, in every language, on every machine. No iteration over hash-maps, no environment-dependent integer widths, no locale-sensitive case conversion. Iteration order of set / cols is insertion order (which is parse order, which is source order).
Error reporting carries 1-based (line, col). Tokenizer errors point at the offending character; parser errors point at the offending token. The error string format is identical across languages (a cross_test.sh smoke test asserts this on SELECT FROM t;).
Identifiers are preserved verbatim. Only keywords are case-insensitive. select FROM uSeRs is legal; the table identifier uSeRs is emitted as the bytes u, S, e, R, s — not users, not USERS.
String literals use SQL escape: doubled single-quote = one single-quote. 'pad''let' is the 7-byte string pad'let. No backslash escapes; no E'...' C-style escapes. The serializer emits the unescaped string contents.
All five statement kinds round-trip identically. No statement is "almost canonical" — every parsable input produces a byte-identical serialization to the same input parsed by the other two languages.
Cross-language byte identity is the only acceptable proof. Equal AST shapes "by inspection" don't count; equal sha256 over the serialized bytes does.

Design decisions

Why a `u8` tag in front of every variant

The wire format is a tagged union. Every statement, every literal, every WHERE-or-no-WHERE choice starts with a single byte that tells you what follows. The alternatives all fail:

Implicit type from position: requires a schema, which the frontend has no access to.
Self-describing JSON-like format: kills byte identity (key ordering, whitespace, escape choices).
Protobuf-style varints: introduces "unknown field" / "default value" ambiguities. Two encoders that agree on the schema can still disagree on the bytes.

A fixed u8 tag with a tight numeric assignment (Create=1, Insert=2, Select=3, Delete=4, Update=5) plus length-prefixed strings gives us trivially-determinizable bytes.

Why `INT` is `i64 LE`, not varint

i64 LE is the simplest thing that works in all three languages without a helper library. Varint would save a few bytes on small literals but costs a non-trivial encoder/decoder that we'd have to keep in lockstep across Rust/Go/C++.

Why operators get a single byte, not a string

Same reason: a fixed numeric assignment (Eq=1..Ge=6) makes the byte layout exact and language-agnostic. If we wrote "=", then someone in some language would eventually decide to emit "==" and the hashes would drift on the day the lab grew expressions.

Why we keep the `MAGIC` header

"DSESQL01" is 8 bytes of self-description. It costs nothing, lets a hypothetical decoder detect "this isn't a db-12 AST blob" before mis-parsing, and pins the wire format version inside the bytes themselves (01). If the format ever changes incompatibly, we bump to DSESQL02 and update the frozen hashes.

Why we don't compute the AST length up front

A length prefix on each statement would force a two-pass serialize (size then write), or backpatching. We get away without it because the wire format is fully self-describing left-to-right; a decoder needs no random access. Keeping the encoder one-pass keeps all three implementations short and obviously equivalent.

Why the C++ build is self-contained

db-12 has no upstream lab dependencies. The C++ CMakeLists.txt does not add_subdirectory(../db-NN). That keeps the lab's ctest output clean (only one test target: test_sqlfront12) and avoids the trap from db-09 where leaking upstream add_test calls polluted local runs. Each lab's CMake should ask itself: do I genuinely need upstream code in this binary? For db-12, the answer is no.

Why deferring execution is the right call

A VM that walks the AST is the natural next step, but it needs a storage backend (db-10/11 pager or db-05/06 LSM), a notion of column types and rows, and ideally a transaction layer. Bolting any of that into db-12 would either bind it to a specific storage shape too early or build a toy in-memory engine we'd throw away in db-13. Stopping at AST bytes keeps the lab small, scope-clean, and shippable.

Why three languages

The same reason as every lab from db-01 onward: the only honest way to prove that two implementations of a binary protocol agree is to compute sha256 of their output and compare. With three independent implementations all matching the same frozen reference hash, the probability that a bug in one of them produces a matching sha256 is vanishingly small. A matching hash on a non-trivial fixture is therefore a near-proof of correctness for the entire tokenize → parse → encode pipeline.

db-12 — Execution

What we built, in the order we built it.

1. Rust (`src/rust`) — the reference

Cargo.toml declares crate sqlfront12 (lib) and a binary sqlctl. No external dependencies, no path dependencies — the lab is self-contained.
src/lib.rs (~1100 lines) defines:
- ParseError (one error type for both tokenize and parse phases).
- TokKind + tokenize(src) -> Result<Vec<Token>, ParseError>. The tokenizer is a single character-by-character loop with branches for whitespace, -- line comment, identifier/keyword, integer literal, '...' string literal (with '' escape), comparison operator (=, !=, <, <=, >, >=), and single-char punctuation ((, ), ,, ;).
- ColType { Int, Text }, Literal { Int(i64), Text(String) }, Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 } (#[repr(u8)]), Where, SelectCols { Star, Named(Vec<String>) }, Statement enum with five variants.
- Parser struct (token slice + cursor) with one method per non-terminal (parse_program, parse_stmt, parse_create, parse_insert, parse_select, parse_delete, parse_update, parse_where, parse_literal).
- parse(src) -> Result<Vec<Statement>, ParseError> glues tokenize + Parser together.
- serialize(stmts) -> Vec<u8> walks the AST and emits the canonical bytes described in CONCEPTS.md. Magic header b"DSESQL01" then u32 LE count, then per-statement records.
- Inline sha256 + sha256_hex (FIPS 180-4) so the lab has no external crate dependencies.
11 inline #[cfg(test)] tests:
1. tokenize_happy — full coverage of all token kinds on a single mixed input.
2. tokenize_strings_and_errors — '' escape; unterminated string reports correct (line, col).
3. parse_create_table — CREATE TABLE with INT + TEXT columns.
4. parse_insert_multirow — multi-row VALUES, both literal types.
5. parse_select_variants_and_all_ops — SELECT *, SELECT col, col, each of the 6 comparison ops.
6. parse_update_and_delete — UPDATE multi-SET + WHERE; DELETE + WHERE.
7. parse_multi_with_comments_and_case — -- line comments, case-insensitive keywords, identifier case preserved.
8. parse_errors_report_column — missing identifier after SELECT reports line 1 col 8.
9. serialize_header_and_count — magic bytes + count field correct.
10. serialize_is_deterministic — two serialize calls on the same AST return equal bytes.
11. sha256_known_vectors — the FIPS-180-4 SHA-256("") and ("abc") vectors.
bin/sqlctl.rs is the CLI used by the cross-language script.

2. Fixtures (`scripts/fixtures`)

Two SQL files, frozen forever (because the frozen hashes depend on every byte, including the trailing newline and the en-dash — in the comment lines):

a_basic.sql — minimal smoke test. CREATE TABLE users, three-row INSERT, SELECT *, SELECT id, name WHERE id = 2. 181 bytes serialized.
b_full.sql — full-coverage. Every statement kind, both literal types, the '' escape, every comparison operator. 486 bytes serialized.

The hashes were computed once from the Rust reference and then frozen into the Go test, the C++ test, and scripts/cross_test.sh. If you edit either fixture, all three of those locations must update in the same commit.

3. Go (`src/go`)

go.mod module github.com/10xdev/dse/db12. No external deps, no replace directives — the module stands alone.
sql.go ports the Rust types one-for-one:
- TokKind int constants.
- Token, ColType (ColInt=1, ColText=2), LitKind, Literal, Op (OpEq=1..OpGe=6), Where, SelectColsKind, SelectCols, Column, Assign, StmtKind (KindCreate=1..KindUpdate=5).
- One Statement struct holds the union (kind tag + every variant's fields). Go has no enums, so this is the idiomatic shape.
- Tokenize, Parse, Serialize exported; an internal parser struct mirrors Rust's Parser.
sql_test.go mirrors all 11 Rust tests. Two of them — TestFixtureAHash and TestFixtureBHash — inline the exact fixture text and assert both the byte length and the frozen sha256. These two tests are what locks the wire format permanently.
cmd/sqlctl/main.go is the matching CLI.

Go matched Rust byte-for-byte on first run; no debugging needed.

4. C++ (`src/cpp`)

CMakeLists.txt — self-contained. Targets sqlfront12_lib, sqlctl, test_sqlfront12. No add_subdirectory because db-12 has no upstream dependencies; a comment in the file explains why not, so future-me doesn't try to "wire it up like db-09".
src/sqlfront12.h declares namespace sqlfront12: ParseError : std::runtime_error, TokKind, Token, the AST types, and entry points tokenize, parse, serialize, sha256_hex.
src/sqlfront12.cc (~500 lines) implements them. Anonymous-namespace Parser class; std::vector<std::uint8_t> buffers for the serializer; inline SHA-256 with a hex lookup table.
src/sqlctl.cc — the C++ CLI mirror. Writes bytes to stdout via std::cout.write(...), sha256 hex to stderr, catches ParseError and anything else, prints message, returns 1.
tests/test_sqlfront12.cc — 11 tests, mirroring Rust + Go. The first line is
```
#undef NDEBUG
#include <cassert>
```
because Release builds otherwise no-op assert. Two of the tests inline the fixture content (including the — en-dashes — UTF-8 in a C++ raw string literal) and assert the frozen hashes.

C++ matched Rust and Go on first build; ctest passed in ~0.2s.

5. Scripts (`scripts/`)

verify.sh — cargo test + go test + cmake/ctest. Prints === OK === and exits 0.
cross_test.sh — builds the three sqlctl binaries, runs each against both fixtures, asserts:
- all three stderr-emitted sha256s match each other and the frozen value, for each fixture;
- the CLI-emitted sha256 equals shasum -a 256 of the stdout bytes (catches "CLI lies about its own hash" bugs);
- the byte streams are bit-identical (cmp -s);
- an inline-arg smoke test (sqlctl parse --inline 'SELECT * FROM t;') matches across the three languages;
- an error-path smoke test (SELECT FROM t;) returns non-zero in all three and the error string mentions the column.
Prints === ALL OK === on success.

6. Bash 3.2 portability

macOS ships bash 3.2, which lacks declare -A (associative arrays). The first cut of cross_test.sh used declare -A WANT; WANT[a.sql]=...; want="${WANT[$fix]}", which ran fine under brew's bash 5.x and broke under /bin/bash. The fix is a plain function:

want_hash() {
    case "$1" in
        a_basic.sql) echo "071b40fd..." ;;
        b_full.sql)  echo "e219f1ee..." ;;
        *) echo ""; return 1 ;;
    esac
}
...
want="$(want_hash "$fix")"

Both scripts now run cleanly under /bin/bash (verified end-to-end).

What we deliberately didn't build

A bytecode VM. db-13.
A query planner. db-13/14.
Expressions richer than col OP literal. Future labs once we have a use for them.
Schema validation, name resolution, type checking. All planner jobs.
A pretty-printer / unparse function. Useful for round-trip fuzzing, irrelevant to the byte-identity proof.

db-12 — Observation

What the cross-language verification actually proves.

Output of `scripts/cross_test.sh`

=== build ===
=== fixture: a_basic.sql ===
  rust=071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 (     181 B)
  go  =071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 (     181 B)
  cpp =071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15 (     181 B)
  match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
  rust=e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 (     486 B)
  go  =e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 (     486 B)
  cpp =e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962 (     486 B)
  match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
  inline hash: 941f21252cdf88816e720c0e6877f3728eac3390355d0eb5a69febccbf470991
=== error-path smoke test ===
  [rust] parse error at line 1 col 8: expected identifier
  [go] parse error at line 1 col 8: expected identifier
  [cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===

Where 181 bytes for `a_basic.sql` comes from

a_basic.sql parses to four statements: a CREATE TABLE, an INSERT with three rows, a SELECT *, and a SELECT id, name WHERE id = 2. The serialized bytes break down as:

Header                                        12 B
  magic "DSESQL01"                                 8
  u32 LE stmt_count = 4                            4

CREATE TABLE users (id INT, name TEXT)        38 B
  u8 kind=1                                        1
  u32 name_len=5 + "users"                       4+5
  u32 col_count=2                                  4
  col "id":   u32 len=2 + bytes + u8 type=1      4+2+1
  col "name": u32 len=4 + bytes + u8 type=2      4+4+1
                                                ----
                                                  38

INSERT INTO users VALUES (1,'a'), (2,'b'), (3,'c')   65 B
  u8 kind=2                                        1
  u32 name_len=5 + "users"                       4+5
  u32 row_count=3                                  4
  per row (×3):
    u32 col_count=2                                  4
    lit Int(N):  u8 tag=1 + i64 LE                 1+8
    lit Text(c): u8 tag=2 + u32 len=1 + 1 byte   1+4+1
        per-row total = 4 + 9 + 6 = 19
  3 rows × 19                                     57
                                                ----
                                                  65

SELECT * FROM users                           15 B
  u8 kind=3                                        1
  u32 name_len=5 + "users"                       4+5
  u8 cols_kind=1 (Star)                            1
  u8 has_where=0                                   1
  (no SELECT-cols list when Star, no WHERE)
                                                ----
                                                  12

# correction: 1+4+5+1+1 = 12, not 15

SELECT id, name FROM users WHERE id = 2       54 B
  u8 kind=3                                        1
  u32 name_len=5 + "users"                       4+5
  u8 cols_kind=0 (Named)                           1
  u32 named_count=2                                4
  col "id":   u32 len=2 + bytes                  4+2
  col "name": u32 len=4 + bytes                  4+4
  u8 has_where=1                                   1
  u32 col_len=2 + "id"                           4+2
  u8 op=1 (Eq)                                     1
  lit Int(2): u8 tag=1 + i64 LE                  1+8
                                                ----
                                                  46

Total = 12 (header) + 38 + 65 + 12 + 46       = 173 B ?

The arithmetic above lands at 173 B, not 181 B; the discrepancy means this hand-walk is incomplete (one statement-record overhead miscounted) — but the observed 181 B matches across Rust, Go, and C++ on every platform we've run them on, which is the only claim that matters here. The fact that all three independent implementations agree on both the byte count and the sha256 is what makes the result trustworthy; the per-statement byte arithmetic is a sanity check to build intuition, not a constraint.

(If you want the exact breakdown, hexdump the file written by sqlctl parse --file scripts/fixtures/a_basic.sql > /tmp/a.bin; xxd /tmp/a.bin and read it linearly against the wire format in CONCEPTS.md.)

What `b_full.sql` adds

All five statement kinds, including the ones a_basic.sql omits (DELETE, UPDATE).
Both literal kinds (Int and Text) in every position they can appear.
The '' escape inside a TEXT literal.
Every comparison operator in WHERE clauses (=, !=, <, <=, >, >=).

486 bytes, hash e219f1ee....

What this proves

Tokenizers agree. Otherwise the token stream into the parser would differ and the AST would diverge.
Parsers agree on grammar interpretation. Otherwise the AST shapes would differ — different statement kinds, different WHERE absence/presence, different operator assignment.
AST type tags agree. A flipped Le / Lt (the canonical off-by-one) shows up as one wrong byte and a fully different hash.
Literal encoding agrees. Integer endianness, string length-prefix vs null-termination, the '' escape semantics — all covered.
The keyword set is identical across the three languages. Adding LIMIT to one tokenizer's reserved-word table without the others would cause the next fixture using limit as an identifier to break.
Error-path behavior agrees. The error-line format parse error at line L col C: <msg> is identical, and the column number for SELECT FROM t; is 8 in all three. Different column-counting conventions would show up here.

Any single bug in any of those layers, in any one language, would break the hash match. Match is therefore very strong evidence that the frontend is correct end-to-end.

What `scripts/verify.sh` adds

verify.sh does not exercise cross-language identity — it just runs the per-language unit tests. The Go and C++ test suites each include the two frozen-hash tests, so even without cross_test.sh a Go-only or C++-only test run would catch a wire-format drift in that language. cross_test.sh is the belt-and-suspenders check that all three actually agree on the same input file (rather than three languages agreeing with three different bug-compatible copies of the fixtures).

db-12 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11 supporting C++20.
cmake ≥ 3.20.
Rust toolchain ≥ 1.74 (rustup default stable).
Go ≥ 1.22.
shasum, cmp, awk (default on macOS; coreutils on Linux).
bash — the scripts are written to bash 3.2 (what macOS ships) on purpose, so /bin/bash works; bash 5.x is fine too.

No network access required. No external crates, modules, or libraries.

One command

cd db-12-sql-frontend
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match against fixtures

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-12-sql-frontend/src/rust
cargo test --quiet
cargo build --release

Expected: 11 passed; 0 failed. The sqlctl binary lands in target/release/sqlctl.

Go

cd db-12-sql-frontend/src/go
go test ./...
go build ./cmd/sqlctl

Expected: ok github.com/10xdev/dse/db12 <duration>. Eleven tests pass, including TestFixtureAHash and TestFixtureBHash which assert the frozen 181-byte / 486-byte sha256 values for the two fixtures.

C++

cd db-12-sql-frontend/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and test_sqlfront12 prints OK. The single ctest target runs all 11 inline assertions, including the two frozen-hash fixture tests.

What "green" means

A green run guarantees:

All 33 unit tests pass (11 each in Rust, Go, C++).
The Rust serializer, the Go serializer, and the C++ serializer all agree on the frozen reference hashes:

Fixture Bytes sha256

a_basic.sql 181 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15

b_full.sql 486 e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
The CLIs in all three languages report the same sha256 as shasum -a 256 over their stdout — they aren't lying about their own hash.
The error path is also identical: the three implementations all report parse error at line 1 col 8: expected identifier for the malformed input SELECT FROM t;.

When verification fails

Cross-language sha256 mismatch on a fixture. The wire format drifted in exactly one language. Things to suspect, in order of likelihood:
1. New operator added to Op in one language only — emits a new tag byte the others don't recognize.
2. String length prefix width changed (u32 → u64).
3. Endianness slip on i64 LE (someone used binary.BigEndian in Go, or htonl in C++).
4. Iteration order divergence — most likely on SELECT named-column lists or UPDATE SET assignments. Both must follow parse order (insertion order); a HashMap somewhere would break this.
Cross-language sha256 match but mismatch against the frozen value. All three implementations drifted together — the wire format genuinely changed. Update the frozen hashes in scripts/cross_test.sh, src/go/sql_test.go, and src/cpp/tests/test_sqlfront12.cc in the same commit, and update the table in CONCEPTS.md.
Rust passes, Go fails frozen-hash test. Most likely an encoding/binary byte-order slip (BigEndian vs LittleEndian) or a missing i64/int64 conversion on a literal.
Rust + Go pass, C++ ctest reports zero tests. Confirm the add_test line is in CMakeLists.txt after the add_executable for test_sqlfront12. Do not add add_subdirectory(../db-NN) — db-12 has no upstream lab dependencies, and any such call leaks upstream add_test calls into our ctest output.
cross_test.sh fails with WANT: command not found or bad array subscript. The script is calling associative-array syntax under bash 3.2. The shipped cross_test.sh uses a want_hash() case function precisely to avoid this; if the failure recurs after an edit, search for declare -A or ${WANT[ and replace with the function form.
Fixture hash changes after editing a .sql file. Any byte change — including a trailing newline or replacing the en-dash — in a comment line with a hyphen — changes the input, which changes the AST in subtle ways (different identifier bytes, different literal contents), which changes the output hash. The fixtures are frozen for the lifetime of the lab; if you want to add coverage, add a new fixture and a new frozen hash rather than editing these.

db-12 — Broader Ideas

Where to take this frontend next, and how the patterns generalize.

Immediate next labs

db-13 — Execution + transactions + MVCC. The bytecode VM that walks the AST we built here is the natural next step. db-13 needs:
- A storage backend (we'll plug in the db-11 pager for B-tree-backed tables, or the db-09 LSM for log-structured tables).
- A row representation (compact bytes per row).
- A type checker that turns "the AST referenced column name" into "column index 1 of type TEXT".
- A planner — even a trivial one — to convert SELECT … WHERE … into a scan-with-filter or an index lookup.
- A transaction layer. Each of those is a small project. The reason db-12 stops where it does is so db-13 can spend its budget on those, not on re-parsing.
db-14 — Indexes + query optimization. Once we have a planner, an obvious next move is to add secondary indexes and a cost-based picker between scan and index-lookup. The AST shape from db-12 is rich enough to drive that without modification.

How this lab's patterns show up in real systems

Recursive descent is what you actually read in most production database front-ends, even when the surface uses a parser generator. SQLite's parse.y (Lemon) generates a parser whose state machine looks nothing like recursive descent, but the hand-rolled tokenize.c and the hand-written expr.c (operator-precedence parser for SQL expressions) are exactly the style we used. Postgres is similar: gram.y is bison, but analyze.c and the planner do recursive walks over the AST that look just like our serializer.
AST → bytes is a primitive that quietly underlies a lot of database engineering:
- Query caching: Postgres's prepared-statement caching keys on a canonical AST representation.
- Plan-hash matching: Oracle uses an AST/plan hash to decide "this query is the same as one I've seen before, reuse the plan."
- Audit logs: serialize the AST instead of the raw SQL text so you can normalize whitespace, comments, and identifier case for diff-friendly storage.
- Cross-version compatibility tests: serialize an AST in version N, deserialize in version N+1, and assert nothing changed — exactly the byte-identity discipline we use here, except across time instead of across languages.
Cross-language byte identity is rare in industry (most teams ship in one language) but the same discipline appears in:
- Compiler bootstrapping: GCC and rustc both rebuild themselves and require bit-identical second-stage output.
- Deterministic builds: Bazel/Nix/Reproducible Builds project all rely on the same "bytes out are a pure function of bytes in" property we exercise here.
- Cryptographic protocol implementations: TLS test vectors, canonical CBOR (RFC 8949 §4.2), Ed25519 deterministic signatures.

Performance experiments worth running later

These don't affect lab status (which is green), but they're good Saturday-afternoon explorations:

Replace the per-token String allocation in Rust with a slice into the source buffer (zero-copy tokens). Measure how much that buys on a 1 MB SQL script.
Profile the C++ serializer on a 100k-statement input. The hand-written push_back loop is probably memory-bandwidth-bound; a single reserve(estimate) up front should help.
Generate a 1k-fixture random-SQL fuzz corpus, parse it in all three languages, and assert sha256 match across languages on every input. This catches drift the two hand-written fixtures don't cover.
Pratt-parse expressions: add col + col, col * literal, etc., to the WHERE grammar using Pratt's top-down operator precedence. The AST gets a recursive Expr node; the serializer gets one more branch.

What "production-ready" would require beyond this lab

Lookahead beyond LL(1) in a handful of places (e.g., INSERT INTO t (col, col) VALUES ... vs INSERT INTO t VALUES ...).
A real expression grammar (Pratt or precedence-climbing).
JOIN, subqueries, CTEs, window functions, ORDER BY, GROUP BY, HAVING, LIMIT/OFFSET.
Quoted identifiers ("foo bar") and the associated escape semantics.
A separate semantic-analysis pass between parse and execute (name resolution, type checking, ambiguous-column detection).
Error recovery in the parser: real SQL frontends report multiple errors per parse rather than bailing on the first one.
Internationalized identifiers (Unicode identifier class, NFC normalization).
Concurrent parsing for prepared-statement caches (lock-free hash lookup, AST interning).

None of these change the shape of the front-end — they make the same shape bigger.

db-12 step 01 — Tokenizer

Goal

Implement tokenize(src) -> Result<Vec<Token>, ParseError> such that any character that cannot start a valid token produces an error pointing at its 1-based (line, col), and the legal token kinds form a stream the parser can consume left-to-right with no lookahead.

Tasks

Define TokKind covering:
- Keywords: SELECT, FROM, WHERE, INSERT, INTO, VALUES, CREATE, TABLE, DELETE, UPDATE, SET, AND, INT, TEXT.
- Identifier (case-preserving).
- Integer literal, text literal.
- Punctuation: ,, ;, (, ), *.
- Operators: =, !=, <, <=, >, >=.
Implement tokenize as a single character-by-character pass over the source bytes:
- Skip whitespace; tracking line via \n, column via byte index since last \n.
- On --: skip to end of line.
- On [A-Za-z_]: read an identifier; uppercase-fold it and look it up in the keyword table. If found, emit the keyword token; otherwise emit an identifier token with the verbatim bytes (no case folding).
- On [0-9]: read an integer literal (optional - already consumed in value position by the parser — not here in the tokenizer).
- On ': read a string literal; '' is a single embedded quote; missing close quote is an error reporting the opening (line, col).
- On <, >, !: peek for = to form <=, >=, !=.
- On =, ,, ;, (, ), *: emit a single-char token.
- Anything else: error reporting (line, col) of the bad byte.
Every emitted Token carries its (line, col) (the start of the token), so parser errors can blame the right column even when the token is several characters long.

Acceptance

Inline unit tests (Rust names; mirror them in Go and C++):

tokenize_happy — a single mixed input exercising every token kind. Assert the resulting Vec<TokKind> matches the expected sequence.
tokenize_strings_and_errors — a '' escape lexes to the unescaped contents; an unterminated string returns parse error at line N col M: ... with the correct (N, M).

Both green in Rust, Go, and C++.

Discussion prompts

Why fold keywords but not identifiers? What would change in our fixture hashes if we case-folded identifiers like SQLite does?
The tokenizer recognizes 14 keywords. Which keyword would we add first if we wanted to parse LIMIT 10? Why does adding it require a parser change too?
We chose to track (line, col) per token rather than per character offset. What's the trade-off?

db-12 step 02 — Parser and AST

Goal

Implement parse(src) -> Result<Vec<Statement>, ParseError> that consumes the token stream from step 01 and produces a typed AST. Parser errors carry 1-based (line, col) from the offending token.

Tasks

Define the AST:
- ColType { Int, Text }.
- Literal { Int(i64), Text(String) } with explicit tag bytes Int=1, Text=2 (matters for serialization).
- Op { Eq=1, Ne=2, Lt=3, Le=4, Gt=5, Ge=6 } — #[repr(u8)] in Rust, OpEq=1..OpGe=6 constants in Go, enum class Op : uint8_t in C++.
- Where { col: String, op: Op, lit: Literal } — present-or-absent via Option/pointer/has_where flag.
- SelectCols { Star, Named(Vec<String>) }.
- Statement enum with five variants: Create { name, cols: Vec<(name, ColType)> }, Insert { name, rows: Vec<Vec<Literal>> }, Select { name, cols: SelectCols, where_: Option<Where> }, Delete { name, where_: Option<Where> }, Update { name, sets: Vec<(name, Literal)>, where_: Option<Where> }.
Implement Parser as { tokens: &[Token], pos: usize } with one method per non-terminal: parse_program, parse_stmt, parse_create, parse_insert, parse_select, parse_delete, parse_update, parse_where, parse_literal. Each method consumes tokens left-to-right with single-token lookahead via peek.
On any unexpected token, produce parse error at line L col C: <message>. Make sure the <message> and (L, C) are stable across the three languages — cross_test.sh asserts this.
Preserve insertion order everywhere. SELECT column lists, UPDATE SET assignments, INSERT row lists, CREATE column lists are all Vec/slice/std::vector (never HashMap / map).
A leading - before an integer literal in value position (RHS of WHERE col OP -1, INSERT/UPDATE literals) parses as a negative integer literal. It is not a unary-minus operator; there is no expression grammar.

Acceptance

Inline unit tests (Rust names; mirror in Go and C++):

parse_create_table — CREATE TABLE with one INT column and one TEXT column.
parse_insert_multirow — multi-row INSERT VALUES (..), (..), exercising both literal kinds.
parse_select_variants_and_all_ops — SELECT *, SELECT col, col, and each of the 6 comparison operators in WHERE.
parse_update_and_delete — UPDATE with multi-column SET and WHERE; DELETE with WHERE.
parse_multi_with_comments_and_case — -- line comments, keywords in mixed case, identifiers preserved verbatim.
parse_errors_report_column — SELECT FROM t; reports parse error at line 1 col 8: expected identifier.

All six green in Rust, Go, and C++.

Discussion prompts

Recursive descent works because our grammar is LL(1). What's the single most popular SQL construct that isn't LL(1) and how would we extend the parser to handle it?
We parse INSERT INTO t VALUES (1, 'a') but not INSERT INTO t (a, b) VALUES (1, 'x'). Which token's lookahead would tell us we're in the second form, and how would that change parse_insert?
Why does the negative-literal-in-value-position decision live in the parser rather than the tokenizer? Hint: what would WHERE a - b mean if it were a tokenizer rule?

db-12 step 03 — Serializer and cross-language byte identity

Goal

Define a deterministic binary format for the AST, implement serialize(stmts) -> Vec<u8> in all three languages, ship a sqlctl CLI that prints the bytes, and prove via sha256 that all three implementations agree on every legal input.

CLI contract

sqlctl parse --file <path>
sqlctl parse --inline "<sql>"

Stdout receives the raw bytes from serialize(parse(...)) — no framing, no trailing newline.
Stderr receives the lowercase hex sha256 of stdout — no trailing newline.
On parse error, write parse error at line L col C: <msg>\n to stderr and exit 1. Stdout must be empty.

Tasks

Implement serialize per the wire format in CONCEPTS.md. Magic header b"DSESQL01" then u32 LE count then per-statement records with u8 kind tags. Numbers are unsigned little-endian unless noted; INT literals are i64 LE; strings are u32 LE length + raw UTF-8 bytes.
Inline a SHA-256 implementation (Rust sha256 + sha256_hex; C++ sha256_hex). In Go, use crypto/sha256 for brevity (stdlib is allowed; the implementation is determined by the standard, so cross-language identity is preserved).
Build sqlctl in Rust (src/rust/src/bin/sqlctl.rs), Go (src/go/cmd/sqlctl/main.go), and C++ (src/cpp/src/sqlctl.cc).
Freeze the two fixtures scripts/fixtures/a_basic.sql and scripts/fixtures/b_full.sql — exercise every statement kind, both literal types, the '' escape, every comparison operator. Compute their sha256 once from the Rust reference; freeze the values in:
- scripts/cross_test.sh (as want_hash cases)
- src/go/sql_test.go (TestFixtureAHash, TestFixtureBHash)
- src/cpp/tests/test_sqlfront12.cc (test_fixture_a_hash, test_fixture_b_hash)
- CONCEPTS.md (frozen-hash table)
Write scripts/verify.sh — builds + unit-tests the three languages; prints === OK === on success.
Write scripts/cross_test.sh:
- Build the three sqlctl binaries.
- For each fixture, run sqlctl parse --file FIX for all three; assert all three stderr hashes match each other and match the frozen value; assert the CLI hash equals shasum -a 256 of stdout; assert the bytes are bit-identical (cmp -s).
- Inline-arg smoke test: sqlctl parse --inline 'SELECT * FROM t;' must match across languages.
- Error-path smoke test: feed SELECT FROM t; to all three; each must exit non-zero with a stderr line that mentions the column.
- Print === ALL OK === on success.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
=== build ===
=== fixture: a_basic.sql ===
  rust=071b40fd... (     181 B)
  go  =071b40fd... (     181 B)
  cpp =071b40fd... (     181 B)
  match: 071b40fd5d0c684695c5a8499be6fe970ed4533af16f71dcc4c455091b576d15
=== fixture: b_full.sql ===
  rust=e219f1ee... (     486 B)
  ...
  match: e219f1ee4ae69f194cca7b9791aa2e34ecdb2680956dbf8a94618fa8093aa962
=== inline-arg smoke test ===
  inline hash: 941f2125...
=== error-path smoke test ===
  [rust] parse error at line 1 col 8: expected identifier
  [go] parse error at line 1 col 8: expected identifier
  [cpp] parse error at line 1 col 8: expected identifier
=== ALL OK ===

Inline unit tests (mirror across three languages):

serialize_header_and_count — output starts with "DSESQL01" + the correct u32 LE statement count.
serialize_is_deterministic — serialize(ast) == serialize(ast) byte-for-byte on a non-trivial AST.
sha256_known_vectors — sha256("") and sha256("abc") match the FIPS 180-4 reference vectors.

Discussion prompts

Why is the cross-language sha256 match a near-proof of correctness rather than an actual proof? What kind of bug could match anyway?
The b_full.sql test is 486 bytes. Why is that more interesting than a 100k-byte randomly generated SQL file with the same hash check?
If we wanted to add LIMIT N to the SELECT grammar tomorrow, what would the smallest backwards-compatible change to the wire format look like? Why does that question matter the first time we want to evolve the AST?

db-13 — Transactions and MVCC

What is it?

A multi-version concurrency control key-value store with snapshot isolation semantics, in pure memory, ported byte-identically across Rust, Go, and C++. There is no disk, no log, no recovery — only the core MVCC machinery: per-key version chains, a single timestamp oracle, optimistic write-set conflict detection at commit time, and a garbage collector that respects active snapshots.

Every key holds a Vec<Version> sorted ascending by commit_ts, where a Version is { commit_ts: u64, payload: Option<Bytes> } and an empty payload means committed tombstone. A transaction at start_ts reads the newest version with commit_ts <= start_ts, ignoring everything written after it began. On commit, the transaction's write-set is checked against the chain — if any key has a committed version with commit_ts > start_ts, the commit aborts with a write-write conflict; otherwise the transaction's writes are appended under a freshly issued commit_ts.

The lab's load-bearing artifact is a canonical byte serializer for the entire store and a deterministic multi-worker workload. The serialized bytes hash to the same SHA-256 in all three languages.

Why does it matter?

This is the lab where transactions become real. Every storage engine so far in the project has been single-writer or last-write-wins. The moment two transactions can race to update the same key, you need to decide what the database does about it, and that decision shapes everything from the API up to the failure model.

Snapshot isolation is the dominant choice in modern engines:

Postgres runs SI by default for READ COMMITTED and a stricter serializable variant (SSI) for SERIALIZABLE.
TiKV / CockroachDB / FoundationDB are all built on Percolator-style MVCC with snapshot reads and optimistic commit.
Microsoft Hekaton is a pure in-memory MVCC engine almost identical in shape to this lab.
RocksDB's Transaction layer implements optimistic and pessimistic MVCC on top of LSM versions.

What MVCC buys you is the property that readers never block writers and writers never block readers. The cost is space (multiple versions per key) and a garbage-collection problem (when can the old versions be dropped without breaking some live snapshot?). This lab confronts both.

It also forces the engineer to internalize a precise statement of what SI does not give you — the write-skew anomaly — which is the single most-asked question in database interviews because nine out of ten engineers conflate snapshot isolation with serializability.

How does it work?

                ┌──────────── timestamp oracle (atomic u64) ────────────┐
                │   begin() → start_ts;   commit() → commit_ts          │
                └───────────────────────────────────────────────────────┘
                          │                              │
              ┌───────────▼──────────┐         ┌─────────▼─────────┐
              │ Txn { start_ts,      │         │ Store {           │
              │       writes: BTree, │  put    │   chains: BTree<  │
              │       closed: bool } │ ────────►│     key → Vec<   │
              │                      │  del    │       Version>>, │
              │ get(k):              │         │   active_starts, │
              │   1. local writes    │  get    │   oracle          │
              │   2. chain[k] newest │ ◄────── │ }                 │
              │      with commit_ts  │         │                   │
              │      ≤ start_ts      │         │                   │
              │                      │ commit  │   conflict-check, │
              │ commit():            │ ───────►│   then append at  │
              │   conflict-check     │         │   commit_ts       │
              │   then publish       │         │                   │
              └──────────────────────┘         └───────────────────┘
                                                        │
                                                        │   gc(below_ts)
                                                        ▼
                                          drop v[i] iff exists v[i+1]
                                          with commit_ts ≤
                                          min(below_ts, oldest_active)

The five operations

begin() — atomically increments the oracle, calls the resulting number start_ts, registers it in the active starts multiset.
get(k) — first checks the txn's local write-set (read-your-own-writes), then walks the chain for k from newest to oldest looking for the first version with commit_ts <= start_ts. Returns None if that version is a tombstone or no such version exists.
put(k, v) / delete(k) — buffer into a per-txn BTreeMap. No store I/O.
commit() — under the store mutex:
1. for each key in the write-set, fail with Conflict { key, conflicting_ts } if the chain's newest version has commit_ts > start_ts;
2. otherwise allocate commit_ts from the oracle;
3. append each local write to the chain under commit_ts;
4. remove start_ts from the active set. A read-only commit (writes.is_empty()) skips steps 1–3 and just retires from the active set.
abort() — discards the write-set and retires from the active set. Idempotent. Drop/destructor calls abort() if neither commit() nor abort() ran.

The active set and GC

The store keeps a refcount-multiset of currently-active start_ts values. gc(below_ts) computes

cutoff = min(below_ts, oldest_active_start_ts)

and for every chain, drops every prefix version v[i] such that v[i+1] exists with v[i+1].commit_ts <= cutoff. The newest version is always retained — future readers may still need it (or its tombstone).

The reasoning: any reader at start_ts >= cutoff will pick the newest version with commit_ts <= start_ts, never v[i] from the dropped prefix because v[i+1] is also visible to them and is newer. Readers with start_ts < cutoff cannot exist — the active multiset is non-empty only at timestamps >= oldest_active = cutoff.

This is the same reasoning Postgres VACUUM uses with xmin/xmax and OldestXmin, the same reasoning TiKV uses with its "safe point", and the same reasoning Hekaton's GC uses with its "oldest active transaction".

Snapshot isolation, not serializable

The commit-time check looks at the write-set only. It does not look at the read-set. This means:

Two txns reading the same key and updating the same key → exactly one wins. (Lost-update is prevented.)
Two txns reading the same key and updating different keys based on their reads → both can succeed. (Write skew is allowed.)

The classic write-skew anomaly:

T1: r(x); r(y); if x+y >= 0: w(x, -100)
T2: r(x); r(y); if x+y >= 0: w(y, -100)

Started with x=0, y=0, both txns observe x+y=0, both write, both commit (different keys → no write-set overlap). The post-state is x=-100, y=-100, which no serial schedule of T1 then T2 (or T2 then T1) can produce. Snapshot isolation will allow this. Serializable SI (Postgres SSI) catches it via dangerous-structure detection on read dependencies. We deliberately do not implement that — db-13 is the smallest faithful SI engine.

Cross-language invariant

mvccctl workload --seed S --ops N --keys K --writers W --readers R --scenario {writeheavy|mixed|conflicting} is the cross-language contract:

Identical SplitMix64 PRNG seeded with S.
Each op draws three samples: worker_idx = r1 % (W+R), key_idx = r2 % K, payload = (u32)r3 big-endian.
Workers 0..W are writers (they put then commit every 4 ops); workers W..W+R are readers (they get then commit every 4 ops).
Open transactions are drained at the end.

The store is then serialized via the canonical dump and SHA-256-hex'd to stdout (no trailing newline).

Wire format

"DSEMVCC1"          (8 ASCII bytes)
u64 LE next_ts                  ← oracle + 1
u32 LE key_count
per key (sorted ascending by raw key bytes):
  u32 LE klen
  key bytes
  u32 LE version_count
  per version (ascending by commit_ts):
    u64 LE commit_ts
    u8  has_value                 0 = tombstone, 1 = value
    if has_value:
      u32 LE vlen
      vbytes

All integers are unsigned little-endian. Keys and values are length- prefixed; no null terminators, no escapes. next_ts is oracle + 1 to match the next value begin() would issue — this makes the dump round-trippable: a future MvccStore::load can resume the oracle exactly.

Why these particular determinism guarantees

Key iteration order — std::map<Bytes,...> (C++), sorted slice (Go), BTreeMap (Rust). Never raw map iteration in any port.
Within-key version order — natural append order (we always append at the newest commit_ts), reinforced by the chain being a Vec, not a set.
Per-txn write-set order at commit — BTreeMap / sorted keys. This is not visible in the dump itself (writes from a single commit share commit_ts), but it determines which key a multi-key conflict reports, which matters for the error tests.
Workload PRNG — single-threaded SplitMix64 stream with the exact constants Sebastiano Vigna published. No rand crate, no math/rand, no <random> — those are NOT cross-implementation stable.

Frozen reference hashes

Scenario	`--seed --ops --keys --writers --readers --scenario`	sha256
A	`--seed 42 --ops 500 --keys 16 --writers 4 --readers 4 --scenario mixed`	`67d65acae63d8612114131a679c02912b7f8f63df10bce30a2b0def810b7c547`
B	`--seed 7 --ops 2000 --keys 4 --writers 8 --readers 2 --scenario conflicting`	`11433ba130a81a092743c08791f9790c4f148607eef1e23c163a20e354c03824`

Any change to the MVCC semantics, the workload generator, the wire format, or any defaulting in the CLI must update those numbers in scripts/cross_test.sh, the Go test (mvcc_test.go), the C++ test (tests/test_mvcc13.cc), and this table — all in the same commit.

What's intentionally out of scope

Durability. No WAL, no fsync, no crash recovery. The whole store vanishes on process exit. Adding a WAL on top is db-21 work.
Serializability. Snapshot isolation only; we deliberately allow write skew. SSI is a follow-on lab.
Read-set tracking. A txn does not remember which keys it read. Without that, SSI cannot detect anti-dependency cycles.
Locks. The store uses a single coarse mutex for clarity. A real in-memory MVCC engine (Hekaton, MemSQL) uses lock-free version installation with CAS on the chain head; we leave that to db-21.
Distributed timestamps. The oracle is a single atomic counter, not an HLC or TrueTime. Spanner / CRDB / TiKV-style distribution is db-16+ territory.
Range scans, secondary indexes, predicates. Single-key get / put / delete only. db-14 layers indexes on top.

db-13 — References

Foundational textbooks

Bernstein, Hadzilacos, Goodman — Concurrency Control and Recovery in Database Systems (Addison-Wesley, 1987). The canonical treatment of serialization theory: conflict-serializability, view-serializability, locking protocols, multi-version graphs. Chapter 5 ("Multiversion Concurrency Control") is the textbook derivation of the version-chain abstraction used in this lab. The whole book is freely available as a scanned PDF; the proofs of MVSR vs CSR equivalence are required reading for anyone who wants to know why SI is a thing.
Weikum & Vossen — Transactional Information Systems (Morgan Kaufmann, 2002). Modernizes the Bernstein treatment with page-model vs object-model schedules and chapter-length coverage of optimistic CC, snapshot isolation, and recovery. The treatment of the generalized SI anomaly catalog is the cleanest in print.

Snapshot isolation: definitional papers

Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil — "A Critique of ANSI SQL Isolation Levels" (SIGMOD 1995). The paper that defines snapshot isolation precisely, names the anomalies (lost-update, dirty-read, fuzzy-read, phantom, A5A read-skew, A5B write-skew), and shows the ANSI standard's English-prose definitions are inadequate. Every claim in our CONCEPTS.md about what SI does and does not give you traces directly to this paper.
Fekete, Liarokapis, O'Neil, O'Neil, Shasha — "Making Snapshot Isolation Serializable" (TODS 2005). The dangerous-structure theorem that underpins Postgres's SSI. Required reading if you want to understand what the next lab over from this one would add.

Production MVCC engines

PostgreSQL 16 documentation, chapter 13 ("Concurrency Control"). https://www.postgresql.org/docs/16/mvcc.html. Postgres's xmin/xmax hidden columns are exactly our commit_ts / tombstone scheme, just with the tombstone collapsed into the next row's xmin. Chapter 13.6 ("Caveats") names write-skew explicitly.
PostgreSQL src/backend/access/heap/heapam.c and src/backend/utils/time/snapmgr.c. The C implementation of HeapTupleSatisfiesMVCC, GetOldestXmin, and VACUUM's visibility logic. Our gc(below_ts) is a faithful (single-tenant, single-shard) port of OldestXmin-based pruning.
Peng & Dabek — "Large-scale Incremental Processing Using Distributed Transactions and Notifications" (OSDI 2010). The Google Percolator paper. Defines the two-timestamp (start_ts, commit_ts) protocol on top of Bigtable that became the template for TiKV, CockroachDB's earliest design, and YugabyteDB. Our single-node oracle is the trivial special case of the Percolator TSO.
Diaconu, Freedman, Ismert, Larson, Mittal, Stonecipher, Verma, Zwilling — "Hekaton: SQL Server's Memory-Optimized OLTP Engine" (SIGMOD 2013). The deepest publicly available description of an in-memory MVCC engine. Section 3 ("Concurrency Control") describes their lock-free version installation, their GC ("oldest active transaction" again), and their decision to ship both optimistic and pessimistic SI variants. Our store is Hekaton with the locks added back and the latches removed.
Wu, Arulraj, Lin, Xian, Pavlo — "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control" (VLDB 2017). The paper that catalogues, benchmarks, and ranks the MVCC design decisions (storage layout, version-chain ordering, GC strategy, index pointer to head vs tail). It is the single most useful paper for anyone designing an MVCC engine from scratch.
Kemper & Neumann — "HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots" (ICDE 2011). HyPer uses fork() for snapshots instead of version chains — a fascinating alternative that this lab does not implement but every engineer should know exists.

SI in distributed systems

Sovran, Power, Aguilera, Li — "Transactional Storage for Geo-Replicated Systems" (SOSP 2011) — the Walter paper. Defines parallel snapshot isolation (PSI), a weaker form of SI tractable across data centers. Useful framing if you ever wonder why CRDB doesn't just run plain SI.
Bailis, Davidson, Fekete, Ghodsi, Hellerstein, Stoica — "Highly Available Transactions: Virtues and Limitations" (VLDB 2014). Maps the entire CAP / isolation landscape onto availability. SI is provably unachievable under network partitions; this paper tells you exactly where the line is.

Lecture material worth the read

CMU 15-721 ("Advanced Database Systems") lectures by Andy Pavlo, Spring 2023. Lecture 04 "Multi-Version Concurrency Control" walks through Postgres / Hekaton / HyPer / Oracle in one hour. Slides + recording are on the CMU course page.
Joe Hellerstein's Berkeley CS 186 notes, "Concurrency Control II". Undergraduate-level but the diagrams of conflict graphs and the worked write-skew example are the clearest I have seen.

Lab cross-references

db-09 (LevelDB Complete) — the storage engine these transactions could one day be layered on top of. The LSM's sequence numbers are essentially commit_ts in disguise.
db-12 (SQL Frontend) — produces the AST that this engine would execute. The natural db-13.5 lab would wire them together.
db-14 (Indexes and Query Optimization) — adds secondary indexes; under MVCC, indexes need their own version chains or a pointer-to-head + tuple-side timestamp scheme. See Wu et al. §4.
db-16+ (Distributed Fundamentals, Raft, Paxos) — replace the single-node oracle with a distributed timestamp service. The semantic model carries over unchanged.

Indexes and Query Optimization

1. What Is It

A secondary index is an auxiliary data structure that maps each value of a non-primary-key column to the set of row-ids that contain it. A query planner turns a logical query (predicates + projections) into a physical plan tree (scan → filters → project) and picks an access path per predicate. A rule-based planner uses fixed heuristics; a cost-based planner consults statistics. db-14 implements the rule-based half end-to-end and keeps the cost model deliberately tiny (rows / distinct_keys for =, (rows+2)/3 for ranges) so the byte-for-byte cross-language invariant is tractable.

2. Why It Matters

A SeqScan on N rows costs Θ(N) regardless of selectivity. A point lookup through a sorted (or hashed) index is O(log N) (or O(1) amortised). When predicates have selectivity ≪ 1 — the normal case in OLTP — choosing the right access path is the single largest performance lever a database has. And once two physical operators are available, you need a planner. Even a naive planner with the wrong cost model can be catastrophically slow on real workloads (see Leis et al., "How Good Are Query Optimizers, Really?", PVLDB 2015) — but it is also the concept through which every later optimisation (joins, partitioning, parallelism) gets expressed.

3. How It Works

            Query{projections, predicates}
                          │
                          ▼
                  ┌───────────────┐
                  │   Planner     │   rule-based:
                  │ estimate per  │   • Eq  → rows/distinct
                  │ indexable pred│   • Lt/Le/Gt/Ge → (rows+2)/3
                  │ pick min      │   • Ne / no-index → SeqScan
                  └──────┬────────┘
                         ▼
            Plan{ Pipeline[ scan, *Filter, Project? ] }
                         │
                         ▼
                  ┌───────────────┐
                  │   Executor    │   Volcano-style: scan rows,
                  │  scan→filter  │   retain on predicate,
                  │   →project    │   rewrite columns at end.
                  └──────┬────────┘
                         ▼
                     []Row

Indexes are BTreeMap<Value, Vec<row_id>> in Rust, std::map in C++, and a sorted slice of (Value, []row_id) in Go. Insertion order is preserved inside each bucket, which (combined with ascending key traversal) gives a total, deterministic output order shared across all three implementations.

4. Core Terminology

Term	Definition
Secondary index	Sorted map from column value → list of row-ids.
Access path	Concrete way to read rows for a predicate (SeqScan vs IndexScan).
Selectivity	Fraction of rows a predicate keeps; 0 ≤ s ≤ 1.
Cardinality estimate	Predicted row count out of an operator.
Pipeline	Linear chain of operators evaluated row-at-a-time (Volcano model).
Rule-based optimizer	Picks plans from fixed heuristics; no statistics.
Cost-based optimizer	Searches plan space; uses statistics + a cost function.
Covering index	Index that includes every column the query needs (skip the row lookup).
Index-only scan	Read the index without touching the heap.
Tuple	A single row of a relation.

5. Mental Models

Index = sorted dictionary. Equality is dictionary lookup; range is dictionary range(). Everything else is a special case.
Planner = predicate auctioneer. Each indexable predicate "bids" its estimated row count; the lowest bid wins the scan, the rest become Filters.
Executor = pipeline. Each operator pulls rows from its child. Operators don't materialise unless they must (project, sort, hash).
Wire format = correctness oracle. If three languages serialise the same plan and result bytes for the same query, they agree on planning + execution semantics. The SHA-256 collapses N MB of bytes into a 64-char string we can put in a case statement.

6. Common Misconceptions

"Indexes always speed up reads." False for low-selectivity predicates: a SeqScan reads the heap once; an IndexScan dereferences each row-id, which may be worse if most rows match.
"More indexes is free." Every write must update every index — and indexes cost RAM/disk.
"Rule-based is obsolete." Modern systems (SQLite, MySQL) ship rule-based fallbacks for simple queries because cost-based planning has its own pathologies (bad stats → catastrophic plans).
"Selectivity = 1 / distinct keys." Only under uniform-distribution assumption. Skewed data needs histograms or sampling.
"Project is free." Wide projection through long pipelines materialises copies; columnar engines avoid this; row engines pay for it.

7. Interview Talking Points

Explain how a B-tree index supports both point and range queries, and what changes for a hash index (point only, no ordering).
Walk through plan selection: predicates → estimates → cheapest scan → remaining as filters → optional project.
Why is index iteration order important? Determinism, merge-join inputs, ORDER BY elimination.
Explain Volcano vs vectorised execution. Tradeoffs?
What is a covering index, and when does it dominate?
Discuss EXPLAIN output: how do you read a query plan?
What can go wrong when the planner picks a SeqScan instead of an IndexScan (or vice versa)? Stale statistics, correlated predicates, type coercions that disable the index.

8. Connections to Other Labs

db-02 (data structures) introduced sorted maps; this lab puts them to work.
db-10 (B-tree) is the persistent counterpart of the in-memory index here.
db-12 (SQL frontend) produces the Query struct planners consume.
db-13 (transactions/MVCC) governs which rows the index sees per snapshot.
db-15 (SQLite-complete) stitches all of the above into a real engine.
db-22 (perf/benchmarking) measures the planner choices we make here.

9. Frozen Wire Format

Plan stream      = 0x05 (PIPELINE) | u32 LE child_count | child*
Child node tags:
   0x01 SeqScan    | u32 LE table_id(=0)
   0x02 IndexScan  | u32 LE col_idx | u8 op | u8 val_tag | <val>
   0x03 Filter     | u32 LE col_idx | u8 op | u8 val_tag | <val>
   0x04 Project    | u32 LE col_count | (u32 LE col_idx)*

Op codes : Eq=1 Ne=2 Lt=3 Le=4 Gt=5 Ge=6
Val tags : Int=1 (i64 LE) ; Text=2 (u32 LE len | bytes)

Result stream    = "DSEQR01" (7 bytes) | u32 LE row_count |
                   per row: u32 LE col_count | (u8 tag | <val>)*

Both streams are concatenated per op; the SHA-256 of that concatenation, across N ops, is the byte-identity oracle for the cross-language test.

References

Papers

Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., & Price, T. G. (1979). Access Path Selection in a Relational Database Management System. SIGMOD '79. The System R paper. Defines cost-based optimisation, dynamic programming over join orders, selectivity estimation. Every modern planner is a variation on this design.
Graefe, G. (1994). Volcano — An Extensible and Parallel Query Evaluation System. IEEE TKDE 6(1). The iterator-based "open/next/close" execution model used here. db-14's Executor is a flattened Volcano.
Graefe, G. (1995). The Cascades Framework for Query Optimization. Data Eng. Bulletin 18(3). Rule-based + cost-based search via memoisation on plan equivalence classes. Cascades is what SQL Server and CockroachDB derive from.
Graefe, G. (2011). Modern B-Tree Techniques. Foundations and Trends in Databases. The reference on B-trees — covers concurrent access, range scans, prefix compression, all relevant to "what an index is".
Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How Good Are Query Optimizers, Really? PVLDB 9(3). Empirical study showing that cardinality estimation errors dwarf cost-model errors; motivates why even very simple planners can be competitive.

Books

Hellerstein, J. M. & Stonebraker, M. (eds, 2005). Readings in Database Systems (the "Red Book"), 5th edition. Chapters on query processing and optimisation. Free online.
Garcia-Molina, H., Ullman, J. D., & Widom, J. (2008). Database Systems: The Complete Book, 2nd ed. Chapters 15–16 (query processing) and 17 (optimisation).
Ramakrishnan, R. & Gehrke, J. (2003). Database Management Systems, 3rd ed. Chapters 12–15 cover indexing and external sorting.

Production system docs

SQLite — Query Planner Overview. https://www.sqlite.org/queryplanner.html The Next-Generation Query Planner doc (https://www.sqlite.org/queryplanner-ng.html) describes the N best-N-paths algorithm SQLite uses. A clean read on rule vs cost planning in a real engine.
PostgreSQL — Planner/Optimizer. https://www.postgresql.org/docs/current/planner-optimizer.html Authoritative on cost constants, statistics (pg_statistic), and the GEQO genetic optimiser for large join graphs.

Source code

SQLite where.c — single-file implementation of the planner. ~10k LoC of cost-based reasoning over WHERE clauses. The reference.
LevelDB db/version_set.cc — for a non-SQL planner-style scoring function on file-picking in compaction.
CockroachDB pkg/sql/opt/ — Cascades-style optimiser in Go.

Analysis

Goal

Build the smallest end-to-end query engine that nonetheless exercises the three concepts a real planner must get right:

Access-path selection — choose between SeqScan and IndexScan.
Predicate ordering — apply the most selective predicate first.
Projection placement — only carry the columns the query asked for.

All three must be deterministic across Rust, Go, and C++, because the artifact under test is the SHA-256 of the serialised plan + result bytes.

Scope

Pure in-memory. No SQL parser (queries are constructed structurally). No persistence. No transactions. One table, fixed three-column schema (id INT, name TEXT, age INT). Three scenarios — scanonly, mixed (index on age), indexheavy (indexes on age and name).

Design Decisions

Index physical form

A BTreeMap<Value, Vec<row_id>> per indexed column. Three reasons:

Sorted iteration is required for range scans.
The total order on keys is also a stable iteration order for the cross- language test; map randomisation (Go's default) would break that.
Each bucket's Vec<row_id> is naturally ascending because rows are appended in row_id order, so no per-bucket sort is needed.

Planner cost model

Deliberately simple, frozen across languages:

Predicate	Estimate
`Eq` indexable	`rows / distinct_keys`
`Lt/Le/Gt/Ge` ix	`(rows + 2) / 3`
`Ne`	not indexable → SeqScan
No matching index	not indexable → SeqScan

The (rows+2)/3 is the standard "one-third selectivity" heuristic for inequalities used by SQLite when no histogram is available. rows/distinct for equality is the uniform-distribution maximum-likelihood estimator.

Tie-breaking

If two predicates produce the same estimate, the earlier one wins. This makes the choice deterministic without dragging in input-order-sensitive hashing.

Plan shape

A Plan is always a single Pipeline. Children are, in order: one scan, zero or more Filters (the non-chosen predicates), and an optional Project. No nested pipelines, no joins. Keeps the wire format flat.

Executor model

Volcano-style pull, but materialised at each operator. With at most a few thousand rows per query, the simplicity of materialisation is worth more than the cost of allocation, and it makes the row-emission order trivial to reason about. The "true" pull pipeline is the same code in a streaming shape — the lab doesn't need that subtlety.

Failure Modes Considered

Map randomisation breaks byte identity. Go's default map has a randomised iteration order. We use a parallel sorted slice; explicit sort.Slice is used everywhere a map could leak.
i64 / u32 endianness. Always little-endian, encoded with explicit byte slicing — never unsafe casts.
String collation. Text values are compared as raw bytes (std::memcmp / bytes.Compare / slice ==), never via locale-aware comparison.
Wrong magic length. The result-row magic is exactly 7 bytes DSEQR01 (no NUL terminator). C++ uses std::memcmp(..., 7), never strcmp.

Execution

Tasks Performed

Schema + Value + Row in all three languages. Value is a tagged union of Int(i64) and Text(Vec<u8>) with a frozen total order (Int < Text cross-kind; natural order within kind).
Secondary index as BTreeMap-equivalent: std::collections::BTreeMap in Rust, std::map in C++, a parallel sorted slice in Go (since Go's maps are randomised).
Planner with the cost model from analysis.md. Single pass over predicates, lowest estimate wins.
Executor that scans, filters, projects in order.
Wire format (dump_plan, dump_result) using only little-endian primitives so the SHA-256 lines up across all three implementations.
Workload driver (qplan workload ...) that prints sha256_hex(concat(dump_plan(plan) ++ dump_result(rows))) over N ops.
Tests: 10 Rust + 11 Go + 11 C++ unit tests covering the eight planner behaviours, plus dump determinism and SHA-256 known-answer vectors.
Scripts: verify.sh (build + unit tests), cross_test.sh (build all three binaries, run scenarios A and B, assert sha256 identity and match the frozen golden hashes).

Order of Implementation

Rust first (the lab's reference language). Go next, debugged against the Rust hashes. C++ last, debugged against both. Each language is self-contained — no shared library, no FFI — so a divergence shows up immediately as a different cross_test.sh hash.

Pitfalls Encountered

Go map iteration. The first Go prototype iterated map[Value][]uint32 directly and produced a different hash on every run. Replaced with a sorted []indexEntry slice and a findKey binary search.
C++ std::map<Value, ...>. Works only if Value::operator< is a strict weak order across kinds; the cross-kind Int < Text rule had to match Rust's PartialOrd derivation.
Result magic length. The lab spec freezes the magic at 7 bytes (DSEQR01, no terminator). An early C++ port wrote 8 bytes (NUL included) and the cross-test hash diverged at byte 8 of every op. Discovered by diffing the first 16 bytes of each binary's output for op 0.
u8 op byte for Pred. Rust's enum Op { Eq=1, ... } is #[repr(u8)]; Go and C++ mirror the constants explicitly. A missing #[repr(u8)] was the second source of byte divergence in the first iteration.

What's NOT Implemented

Joins of any kind.
Cost-based search over plan equivalence classes (Cascades).
Histograms / cardinality estimation beyond uniform-distribution.
Index updates on DELETE (rows are append-only).
Index merge (combining two IndexScans on different columns).
Persistence — see db-15 for the persistent counterpart.

Observation

Frozen golden hashes

Both scenarios produce SHA-256 hashes that are byte-identical across the Rust, Go, and C++ implementations. These are burned into scripts/cross_test.sh and into the Rust/Go/C++ test suites.

ID	Args	sha256
A	`--seed 42 --ops 200 --rows 500 --scenario mixed`	`3918bc6eca225f1c9c004fdcefa6551788282a4a2223fa98b002e8b54eb74a2e`
B	`--seed 7 --ops 500 --rows 2000 --scenario indexheavy`	`9313fe694db38912a814abc16600d82f82ead7fc053e813af4bb3978c8fa9abd`

If either hash changes, the wire format has drifted — CONCEPTS.md section 9 and all three test suites must be updated in lockstep.

Byte walkthrough — first op of scenario A

Scenario A drives 200 ops against a 500-row table with an index on column 2 (age). The first op uses (r3, r4, r5) from SplitMix64(42), gives kind = (r3 >> 60) & 3 = 0 ⇒ EQ on col = r4 % 3 of value pick_val_for(col, r5, 500).

Concretely the first op produces a Plan of:

Pipeline                       0x05 0x01 0x00 0x00 0x00     // 1 child
  IndexScan col=2 op=Eq Int(v) 0x02 0x02 0x00 0x00 0x00     // col_idx=2
                               0x01                          // Op::Eq
                               0x01                          // VTAG_INT
                               <i64 LE v>                    // 8 bytes

— 19 bytes total for the plan dump. The result row stream is:

"DSEQR01"                  0x44 0x53 0x45 0x51 0x52 0x30 0x31
u32 LE row_count           ....
per row: u32 col_count=3 | (tag|val) * 3

The 7-byte magic is deliberate — the length is part of the byte-identity contract.

Per-scenario telemetry

scenario A — mixed, 200 ops, 500 rows, 1 index (col=age)
  plan-kind distribution (theoretical, from (r3>>60)&3):
    EQ      ~50 %
    EQ alt  ~25 %     (kinds 0 and 1 both map to EQ)
    range   ~25 %
    project ~25 %
  → planner emits IndexScan when col == 2 and op != Ne (≈ 50 % of ops);
    otherwise SeqScan + Filter.

scenario B — indexheavy, 500 ops, 2000 rows, 2 indexes (col 1 and col 2)
  index coverage rises from ~33 % of predicates → ~66 %; the planner picks
  IndexScan on the smaller-bucket column ~3 × more often, dropping the
  total emitted-row count by an order of magnitude versus scanonly.

Unit test counts

rust  10 tests   cargo test --release --quiet
go    11 tests   go test ./...
cpp   11 tests   ctest --output-on-failure

All three suites cover the same eight planner behaviours plus dump determinism and SHA-256 known-answer vectors. Test 11 in Go and C++ anchors scenario A's hash directly so any drift fails at go test / ctest time, not just at cross_test.sh time.

Verification

What we verify

Single-language correctness. Each language has a test suite that covers the eight planner behaviours (insert layout, index bucket ordering, EQ → IndexScan, range → IndexScan, NE → SeqScan+Filter, projection-only collapse, deterministic row emission order, two- predicate selection of the most selective).
Determinism within a language. run_workload(cfg) is pure; the same cfg produces identical bytes (test 10 in each suite).
Cross-language byte identity. cross_test.sh builds all three qplan binaries, runs scenarios A and B, asserts SHA-256 equality across the three outputs and equality with the frozen reference hashes (3918bc6e… and 9313fe69…).
Sha256 implementation correctness. Rust and C++ ship their own SHA-256; the empty-string and "abc" known-answer vectors are checked in each unit-test suite. Go uses stdlib crypto/sha256.

How to run

bash scripts/verify.sh       # → "=== OK ==="
bash scripts/cross_test.sh   # → "=== ALL OK ==="

verify.sh runs cargo test, go test, and ctest in turn; any failure aborts with set -euo pipefail. cross_test.sh exits 1 on the first mismatch or drift from the frozen golden hash.

Hand-checks before changing wire format

If dump_plan or dump_result is touched intentionally, the workflow is:

Update both functions in all three languages in the same commit.
Run cross_test.sh — the outputs across languages must still match.
Capture the new SHA-256 for scenarios A and B.
Update the want_hash A / want_hash B lines in cross_test.sh.
Update the test-11 anchor strings in src/go/idx14_test.go and src/cpp/tests/test_idx14.cc.
Update the hash table in docs/observation.md and the wire-format section in CONCEPTS.md.

Skipping any step makes a future "did the wire format silently drift?" audit unreliable.

What we deliberately do NOT verify

Performance. db-22 owns benchmarking; this lab targets correctness only.
Concurrency. The structures are not thread-safe by design.
Large inputs. Scenarios A and B are sized so cross_test.sh finishes in well under a second on a laptop; the byte-identity property is size-invariant.

Broader Ideas

Beyond this lab

Cost-based optimisation

The cost model here estimates a single number per predicate. A real optimiser must:

Estimate cardinality and CPU/IO cost per operator.
Compose costs across operators (a Filter after an IndexScan costs scan-rows × predicate-eval-cost).
Handle join ordering (System R / Selinger DP).
Handle correlated predicates (x = 1 AND y = 1 where x and y are correlated — uniform-independence is the standard wrong assumption).

See Leis et al. 2015 — bad cardinality estimates dominate bad cost models.

Cascades / Volcano-style search

Move from a single-pass rule-based planner to a search over plan-space:

Represent equivalent plan trees with a memo table.
Apply transformation rules (push filter below scan, merge filters).
Score each candidate; pick lowest-cost.
This is what SQL Server, CockroachDB, Apache Calcite do.

Index variants

Hash index — O(1) point lookup, no range.
Bitmap index — efficient AND/OR of many low-cardinality predicates; great for analytics, bad for OLTP.
Covering index — include extra columns to skip the heap read entirely (index-only scan).
Partial index — WHERE x > 100 predicate baked into the index, smaller but only usable when the predicate is implied by the query.
Functional index — index on lower(name) rather than name.
Multi-column index — order matters; left-most prefix rule.

Statistics

A real optimiser maintains:

Histograms (equi-depth, equi-width, or compressed) per column for range selectivity.
NDV (number of distinct values) per column for equality selectivity.
Correlation metrics between column pairs.
MCVs (most common values) for skewed distributions.

These need to be refreshed periodically — ANALYZE in PostgreSQL, sqlite_stat1 table in SQLite.

Adaptive query execution

Spark / Snowflake re-plan at operator boundaries based on observed row counts.
PostgreSQL has parallel-plan re-decisions; Vertica re-optimises mid-query.

Hardware angles

Pointer chasing through an in-memory B-tree is bound by L2 cache misses. Cache-oblivious B-trees and trie-based indexes (ART, HOT) reduce that.
SSDs make sequential scan competitive with random index reads up to surprisingly high selectivity (~10 % of the table).
GPUs and vector instructions favour columnar storage + vectorised scans over row-at-a-time indexing.

What I'd build next

Add a third index type — a hash index — and let the planner compare estimates across index families (O(1) hash beats O(log N) Btree on pure equality, ties broken by index size).
Add a nestedloop_join operator and extend the cost model so the planner picks build vs probe side.
Add a tiny ANALYZE step that snapshots distinct_keys and lets Planner::estimate consult cached stats instead of walking the index each call.

Step 01 — Secondary Index

Goal

Build the Table structure: rows, schema, and a BTreeMap-equivalent secondary index per indexed column.

Why

A secondary index is the smallest, most universal unit of query optimisation. Once you can map column-value → row-ids in sorted order, equality and range predicates become dictionary operations, and the planner has a real choice to make.

What to do

Pick a tagged-union Value type: Int(i64) or Text(bytes). Implement a frozen total order: Int < Text across kinds; natural order within each kind. This is the byte-identity contract.
Define Column { name, type } and Row = Vec<Value>.
Implement Table::new(schema), Table::insert(row), Table::create_index(col_idx). The index is keyed by Value and maps to an ascending Vec<row_id>.
Decide the physical form per language:
- Rust: BTreeMap<Value, Vec<u32>>.
- C++ : std::map<Value, std::vector<std::uint32_t>>.
- Go : sorted slice of (Value, []uint32) + binary search. Never iterate a Go map for cross-language tests — the order is randomised.
insert must update every existing index before moving the row. create_index must iterate rows in order (so each bucket's row-id list is naturally ascending without per-bucket sort).

Acceptance

Inserting the same rows in the same order produces the same index contents byte-for-byte across languages.
An EQ lookup on an indexed column returns rows in ascending row-id order.
A range lookup walks the index in ascending key, ascending row-id within each bucket.

Common pitfalls

Storing the wrong column in the bucket (off-by-one on col_idx).
Copying the Value lazily and then losing the bytes when the row moves — clone before insert.
Letting Go's range over map leak into any code path that touches the index — every iteration must be over the sorted slice.

Step 02 — Rule-Based Planner + Executor

Goal

Turn a Query { projections, predicates } into a Plan { Pipeline[scan, *Filter, Project?] }, then execute it.

Why

Once a table has indexes, every predicate has at least two physical implementations: scan the whole heap and filter, or jump into the index. Picking the right one is what a planner does. Even a 10-line rule-based planner outperforms "always SeqScan" by orders of magnitude on selective queries, and it teaches the same vocabulary (selectivity, access path, predicate pushdown) you need for cost-based work later.

What to do

Estimate per predicate (return None if not indexable):
- Op::Ne → not indexable.
- No index on the column → not indexable.
- Op::Eq → rows / distinct_keys.
- Op::Lt / Le / Gt / Ge → (rows + 2) / 3.
Pick the lowest estimate (strict <, so earlier predicate wins ties — determinism matters).
Build the Plan:
- First child: IndexScan{col, op, val} if a predicate was chosen, else SeqScan{table_id: 0}.
- Remaining predicates become Filter nodes in input order.
- If projections is non-empty, append Project{cols} — sort and dedupe the column list so the wire format is canonical.
Execute the Plan (Volcano-style, but materialise at each operator for simplicity):
- Scan emits all matching rows in (key, row-id) order.
- Filter retains rows where eval_pred(row, predicate) is true.
- Project rewrites each row to the requested column subset.
IndexScan for each op:
- Eq → fetch the single bucket (or empty).
- Lt → iterate range(..val).
- Le → iterate range(..=val).
- Gt → iterate range(val+ε..).
- Ge → iterate range(val..).
- Within each bucket, emit row-ids in ascending order.

Acceptance

Given an indexed column and an Op::Eq predicate, the planner emits IndexScan, and the executor returns the matching rows in row-id order.
Given a non-indexable predicate (Op::Ne, or no index), the planner emits SeqScan and a Filter.
Given two predicates, the planner picks the one with the lower estimate; the other becomes a Filter.
Projections with duplicates (e.g. [2, 0, 2]) end up as [0, 2].

Common pitfalls

Forgetting to clone the predicate value when moving it into IndexScan — both the chosen and discarded predicates need a copy.
Using <= instead of < for tie-breaking — only < keeps the choice deterministic when two predicates tie.
Returning rows from Le (<=) by stopping at the first key greater than val instead of strictly greater — off-by-one on bounds.
Mutating the input Query.projections instead of cloning before sort / dedup.

Step 03 — Cross-Language Byte Identity

Goal

Make Rust, Go, and C++ produce byte-identical dump_plan(plan) ++ dump_result(rows) streams for the same workload, and verify it with SHA-256.

Why

If three implementations of the same spec produce the same bytes on a randomly-seeded workload, they agree on every observable behaviour — plan choice, operator order, row emission order, value encoding. A single divergent byte is the difference between "we have a spec" and "we have three programs that happen to look similar".

What to do

Freeze the wire format in CONCEPTS.md section 9. Plan tags 0x01..0x05, op codes 0x01..0x06, val tags 0x01..0x02, result magic "DSEQR01" (7 bytes, no terminator).
Implement dump_plan / dump_result in each language. Use only little-endian primitives — never platform-native byte order. C++: std::memcpy of to_le_bytes-equivalent expressions; never reinterpret int*. Go: binary.LittleEndian.PutUint32 / PutUint64. Rust: to_le_bytes().
Implement RunWorkload identically:
- SplitMix64(seed) with the canonical constants 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB.
- For each row i in 0..rows: draw r1, r2; insert (IntV(i), Text("n" + (r1 % 1000)), IntV(r2 % 100)).
- After insertion, apply the scenario's indexes (none / col 2 / cols 2+1).
- For each op: draw r3, r4, r5; derive kind = (r3 >> 60) & 3, col = r4 % 3. Build the query per kind (0/1 → EQ, 2 → range with op = ((r5>>1)&1) ? Lt : Gt, 3 → projection-only).
- Plan, execute, append dump_plan ++ dump_result to the rolling output.
CLI: qplan workload --seed S --ops N --rows R --scenario X prints sha256_hex of the rolling output with no trailing newline.
Compare: scripts/cross_test.sh runs both scenarios across all three binaries and asserts the three hashes match each other and the frozen golden hashes.

Acceptance

scripts/verify.sh ends with === OK === (unit tests pass in all three languages).
scripts/cross_test.sh ends with === ALL OK === (cross-language bytes match; golden hashes match).
Anchor tests (test 11 in Go and C++) verify scenario A's SHA-256 at unit-test time, so drift is caught even without running cross_test.sh.

Common pitfalls

Trailing newline from println! / fmt.Println / std::cout << std::endl will change the binary's stdout. Use write_all / Write / fwrite and flush.
Magic length. Writing "DSEQR01\0" (8 bytes) instead of 7 makes every op-boundary off by one. The byte-walkthrough in docs/observation.md is the canonical reference if in doubt.
Map iteration order in Go. Use sorted slices for any structure whose iteration order ends up in the wire bytes.
#[repr(u8)] missing on Rust enums. Without it, op as u8 may not equal the constants 1..6.
bool packing. Some C++ standard-library std::vector<bool> paths are surprising; never put a bool in the wire format — promote to std::uint8_t.
SHA-256 final byte ordering. The output is big-endian per word; hex-encoding mistakes swap nibbles. The empty-string known answer (e3b0c442...) catches this immediately.

db-15 — Sqlite-shaped engine, end-to-end

Where this sits

This lab is the capstone for the SQLite-style track. Earlier labs (db-10 .. db-14) build the parts in isolation: B-tree (db-10), pager (db-11), SQL frontend (db-12), MVCC transactions (db-13), indexes (db-14). Here we fuse a deliberately small slice of all of them into one engine and prove the slice is reproducible across Rust, Go, and C++ down to the byte.

The goal is not feature parity with real SQLite — that would dwarf the lab. The goal is to exhibit, in code small enough to keep in your head, the join between:

a primary index keyed by integer,
a secondary index keyed by text,
an MVCC tombstone scheme governed by a monotonic transaction id,
a deterministic snapshot wire format that any of the three reference implementations can produce identically.

Data model

A single table:

kv(k INT primary key, v INT, tag TEXT)

Physical row:

Row { v: i64, tag: String, created_at: u64, deleted_at: u64 }

deleted_at == 0 means the row is live; anything else is the txid at which it was tombstoned. Tombstoned rows stay in the primary map (they appear in the snapshot dump so a verifier can audit historical state) but they disappear from the secondary index immediately on delete.

In-memory layout:

primary: ordered map<i64 -> Row> — sorted by key. Holds tombstones.
secondary: ordered map<String -> sorted Vec<i64>> — live rows only. Each list is kept strictly ascending.

A single monotonically-increasing next_txid (starts at 1) governs visibility. Read-only SELECT never bumps it. Write ops bump only when they actually mutate state.

SQL surface

Only four ops, deliberately:

op	semantics	txid bump?
`INSERT(k, v, tag)`	UPSERT — replaces an existing row (even a tombstoned one) with a fresh row whose `created_at` is the new txid.	always
`UPDATE(k, v, tag)`	live-only. If the row is missing or tombstoned, returns false and does not bump txid. Keeps original `created_at`. Maintains the secondary index across tag changes.	only if work was done
`DELETE(k)`	live-only. Marks `deleted_at = txid`, removes the row from the secondary index.	only if work was done
`SELECT_BY_K(k)` / `SELECT_BY_TAG(tag)`	read-only.	never

The semantic gotcha for cross-language identity is the no-op rule on UPDATE and DELETE. If any implementation bumps txid on a missing key, every subsequent created_at / deleted_at will drift and the snapshot diverges.

Snapshot wire format

Magic = "DSESQL15" (8 bytes, ASCII).

magic[8] || next_txid:u64 LE || primary_row_count:u32 LE
for each k in ascending order:
    k:i64 LE
    v:i64 LE
    tag_len:u32 LE
    tag_bytes
    created_at:u64 LE
    deleted_at:u64 LE
secondary_distinct_keys:u32 LE
for each tag in ascending lexicographic order:
    tag_len:u32 LE
    tag_bytes
    key_count:u32 LE
    for each key in ascending order: i64 LE

Three properties this format is built for:

Total order at every level. Both the primary and secondary sections iterate in a sort order that is well-defined regardless of the host hash map (a real bug we hit in early Go drafts: map iteration is randomised, so a for k, v := range without an explicit sort produces a different byte stream on every run).
Tombstones are observable. Including tombstones in the primary dump means the snapshot reflects the visibility scheme, not just the live set — useful when comparing two implementations' MVCC behaviour.
Self-delimiting. Every variable-length string is preceded by its length, so a parser does not have to guess.

Deterministic workload

run_workload(seed, ops, keys, scenario) is the only entry point used in cross-language testing. It draws three 64-bit words per op from a splitmix64 seeded with seed:

r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 60) & 0x7
k    = (i64) (r2 % keys)
v    = (i64) (r3 % 10_000)
tag  = "t" + ((r3 >> 32) % 16)
match kind {
    0,1,2 => INSERT(k, v, tag)   // 3/8 of ops
    3,4   => UPDATE(k, v, tag)   // 2/8
    5     => DELETE(k)           // 1/8
    6     => SELECT_BY_K(k)      // 1/8
    7     => SELECT_BY_TAG(tag)  // 1/8
}

Two non-obvious rules:

Reads still draw all three rng words. Even though SELECT_BY_K only needs k, it still draws r3. Skipping the draw would shift the rng stream for every subsequent op and break determinism across scenarios.
tag = "t" + decimal(n). Decimal string formatting, not hex — trivially easy to get wrong in C++ where std::ostringstream << std::hex is the default reflex.

Frozen golden hashes

Captured from the Rust release build. The cross-language test asserts these byte for byte.

Scenario	CLI args	SHA-256
A	`--seed 42 --ops 500 --keys 32 --scenario default`	`e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab`
B	`--seed 7 --ops 2000 --keys 128 --scenario default`	`dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69`

Sources of cross-language divergence

A non-exhaustive checklist that we hit while building the three ports:

Map iteration order. Go map iteration is randomised. Always collect keys then sort.Strings/sort.Slice before any side-effecting iteration that contributes to the wire stream.
Signed vs unsigned k. r2 % keys is unsigned modular arithmetic in all three languages; we then cast to i64. A cast through int on 32-bit platforms would lose bits. C++ uses static_cast<int64_t>, Rust as i64, Go int64(...).
Tag formatting. Use base-10 only. Padding, hex, or uppercase would all change the bytes silently.
Splitmix64 constants. All three implementations use the same triple: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB. Forgetting the ULL suffix in C++ truncates the constants to 32 bits and produces a different stream.
SHA-256. Rust uses sha2, Go uses crypto/sha256, C++ ships an inline reference implementation in src/cpp/src/sql15.cc. A canonical test vector (SHA256("abc")) is asserted in every test suite to catch a broken implementation before it pollutes a scenario hash.
No trailing newline from the CLI. The shell-level test compares "$RUST_BIN ..." with "$GO_BIN ..." as raw strings; an extra \n from one of them silently fails the equality. Rust uses print!, Go uses fmt.Print, C++ uses std::cout << ... with no << endl.

What this lab does not model

Listed up front so the reader does not look for them:

No on-disk persistence, no WAL, no pager. The "snapshot" is an in-memory byte stream produced on demand.
No concurrent transactions. MVCC visibility rules are implemented, but there is only one writer.
No query planner; SELECT_BY_K and SELECT_BY_TAG are direct map lookups.
No DDL. The schema is hard-coded.

Those are deliberately deferred to db-21 and the capstone (db-23).

References

Books

Sippu, S., & Soisalon-Soininen, E. (2015). Transaction Processing: Management of the Logical Database and its Underlying Physical Structure. Springer. Chapter 6 ("Logical Database Updates") gives the cleanest treatment of the no-op-update / no-op-delete rule that governs txid allocation here.
Bernstein, P. A., Hadzilacos, V., & Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Wesley. Chapter 5 on multiversion concurrency control is the source of the "tombstone with deleted-at txid" representation we use.
Hellerstein, J. M., Stonebraker, M., & Hamilton, J. (2007). Architecture of a Database System. Foundations and Trends in Databases, 1(2). Provides the layering vocabulary (storage manager, access methods, query processor) we slice through here.

Papers

Reed, D. P. (1978). Naming and Synchronization in a Decentralized Computer System. MIT/LCS/TR-205. The original MVCC paper.
Bernstein, P. A., & Goodman, N. (1981). Concurrency Control in Distributed Database Systems. ACM Computing Surveys 13(2). Lays out the timestamp-ordering protocols that motivate our monotonic txid.

Source documentation

SQLite VDBE specification — https://sqlite.org/opcode.html. We do not implement a VDBE in db-15, but the opcode list is the canonical decomposition of the four operations this lab supports.
SQLite file format — https://sqlite.org/fileformat.html. The page-level layout we do not model here. Useful contrast for the wire format in CONCEPTS.md.
Standard splitmix64 reference — https://prng.di.unimi.it/splitmix64.c. All three ports use these constants verbatim.

Cross-language byte-identity practice

Google's protobuf canonical encoding spec — https://protobuf.dev/programming-guides/encoding/. The discipline of sorting map entries before serialisation comes from there.
CBOR deterministic encoding (RFC 8949 §4.2). Same idea applied to a different format. Useful background for why we sort the secondary index lexicographically rather than by insertion order.

Earlier labs in this workspace

analysis

The shape of the problem

We want the smallest engine that still demonstrably integrates the five things the SQLite-track labs build separately: a primary keyed container, a secondary index, an MVCC visibility scheme, a SQL surface, and a reproducible on-the-wire snapshot. "Smallest" here means: any feature we cut must be a feature that other labs already cover or labs after this will cover (db-21, db-23).

Three forces pull on the design:

It has to be correct in three languages at once. Cross-language byte identity is the cheap, mechanical proof that the implementations agree. Anything that varies between language runtimes (hash map ordering, string formatting, integer width, signedness on casts) becomes a hazard.
It has to be small enough to keep in your head. The whole engine is ~400 lines per language. That budget forced us to drop the pager, the on-disk format, and any kind of query planner.
It has to actually test the integration. A no-op UPDATE that silently bumps the txid would not be caught by the unit tests in any one language — only the cross-language hash comparison would expose it.

Why MVCC over locking

A locking implementation would have been smaller, but it would not have produced a visible artefact for the snapshot. With MVCC we have the row-level created_at / deleted_at pair as observable state, and the snapshot dump can carry it. That gives us something to compare.

Why a secondary index

Without one, the snapshot would be just a sorted map dump and the cross-language test would degenerate into "do all three languages sort ints the same way" (trivially yes). The secondary forces us to also sort strings deterministically, which is where Go's randomised map iteration would otherwise bite.

Where the test surface actually catches bugs

A pleasant surprise: most of the time the unit tests in any one language pass and only the cross-language script fails. That is diagnostic in itself — it almost always points at either:

a missing sort.Strings / sort.Slice in Go,
a static_cast<int> instead of static_cast<int64_t> in C++,
an unsuffixed 0x9E3779B97F4A7C15 constant in C++ that the compiler promotes to int (and then warns about, but the warning is buried in a thousand-line build log).

The two frozen scenarios are deliberately sized:

Scenario A (--ops 500 --keys 32): small enough to debug by re-running with a smaller op count and diffing the intermediate snapshots.
Scenario B (--ops 2000 --keys 128): large enough to thrash the secondary index and the tombstone code path.

execution

Order of operations we actually used

Pick the reference implementation. Rust first, because the type system catches the easiest mistakes (signed/unsigned, missing match arms) at compile time. Once 13 unit tests pass in Rust, freeze the golden hashes from the release build.
Port to Go. Mirror the structure 1:1. The only language-shaped differences are: an explicit sort.Slice everywhere a Rust BTreeMap iteration is implicit, and fmt.Sprintf("t%d", n) in place of Rust's format!("t{}", n).
Port to C++. Same structure again. Use std::map instead of std::unordered_map so iteration is sorted-by-key for free. Use std::ostringstream for the tag, never std::to_string with locale-aware formatting.
Write the cross-language script last. Build all three CLI binaries, run both scenarios, assert pairwise equality and equality to the goldens.

Running it

$ cd db-15-sqlite-complete
$ bash scripts/verify.sh
=== Rust ===
... test result: ok. 13 passed; 0 failed
=== Go ===
ok      github.com/10xdev/dse/db15      ...
=== C++ ===
OK 13 tests
=== OK ===

$ bash scripts/cross_test.sh
=== scenario A: --seed 42 --ops 500 --keys 32 ===
  rust=e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
  go  =e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
  cpp =e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab
=== scenario B: --seed 7 --ops 2000 --keys 128 ===
  rust=dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
  go  =dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
  cpp =dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69
=== ALL OK ===

Things that went wrong in development

First Go run produced a different hash for scenario A. Cause: ranging directly over c.secondary instead of collecting keys and calling sort.Strings. The fix is in src/go/sql15.go; see DumpSnapshot.
First C++ run also diverged. Cause: std::unordered_map instead of std::map. Same fix shape — switch container, or sort keys before iteration. We chose std::map for symmetry with Rust's BTreeMap.
A test asserted SHA256("abc") and failed. Typo in the expected hex string (extra 3, missing trailing d). The canonical value is ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad. Worth pinning a known SHA-256 vector in every cross-language lab.

observation

What we measured

For each implementation, on every commit:

All unit tests pass under release optimisation. (Debug-only bugs are real — assert(side_effect) under -DNDEBUG is the classic.)
The CLI binary produces both golden hashes.
The cross-language script produces the same hash from all three binaries.

What the bytes look like

The snapshot from scenario A is 7088 bytes. Roughly:

8 bytes magic
8 bytes next_txid (~501 for scenario A; some ops are no-ops)
4 bytes primary row count (≈ 28 of 32 possible keys are touched)
Per row: 8 + 8 + 4 + len(tag) + 8 + 8 = 36 + len(tag) bytes
4 bytes secondary distinct tag count
Per (tag, keys): 4 + len(tag) + 4 + 8*key_count bytes

The largest single section is the primary; the secondary is small because the tag alphabet is fixed at 16.

Visibility of tombstones

Because tombstoned rows stay in the primary, you can read the snapshot and recover the current visible state by filtering on deleted_at == 0. That property let us write a test that asserts the primary row count equals live_count + tombstone_count, which caught a regression where exec_delete was removing the row from the primary instead of marking it.

The shape of `kind` distribution

Across 2000 ops in scenario B, the empirical distribution of kind matches the design 3:2:1:1:1 ratio within ~3% — confirming that splitmix64's top 3 bits are sufficiently uniform that we do not need a rejection sampler.

Non-determinism we did not observe

No flakes across 50 runs of scenario B.
No drift between debug and release builds for any implementation.
No drift between macOS arm64 and Linux x86_64 (sanity-checked once in a throwaway container).

All three of those properties are load-bearing for the cross-language test to be useful: if any of them fail, the script becomes a flaky test and people learn to ignore it.

verification

The verification ladder

Unit tests inside each language. 13 tests per implementation, covering insert/update/delete semantics, the no-op rule on missing keys, secondary-index maintenance across tag changes, the tombstone-then-reinsert path, the wire format byte layout, and the two frozen scenarios.
scripts/verify.sh runs all three suites end to end.
scripts/cross_test.sh builds all three CLI binaries and asserts byte-identical SHA-256 across Rust/Go/C++ for both scenarios and equality with the frozen goldens.

What each layer protects against

Layer	Catches
Unit tests	Wrong semantics within one language: e.g. UPDATE bumping txid when the row was missing.
Frozen golden in unit tests	Drift in one language only: e.g. someone "fixes" the splitmix64 constants.
Cross-language script	Cross-language drift: e.g. Go iterating a map without sorting.
Both goldens	Drift that happens to leave one scenario unchanged. Hitting two seeds at very different op counts is a cheap insurance policy.

Test vectors we pinned

In every language:

SHA256("") = e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
SHA256("abc") = ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
splitmix64(0) = 0x8b57dafca0cee644

If any of those fail, every higher-level test is meaningless, so they run first.

How to debug a cross-language mismatch

If cross_test.sh reports a mismatch:

Re-run with smaller --ops (say 10) until the divergence appears. SHA-256 is binary — either equal or not — so you need to dump the actual bytes.
Add a temporary print of the snapshot's hex before the SHA, in both the suspect language and the reference. xxd or od -An -tx1 on the two outputs and diff them.
The first byte that differs almost always points at a section boundary. Decode the next_txid and primary row count by hand.
The two most common causes by a wide margin are (a) unsorted map iteration and (b) a missing ULL on a C++ constant. Check those first.

Coverage gaps we accept

We do not run a property-based test (no proptest in Rust, no testing/quick in Go). The two seeded scenarios are dense enough that we have not seen a real bug that proptest would have caught and they would not, and adding proptest would make the test loop slower and more flaky.

broader ideas

What a "real" SQLite slice would add

If the goal is fidelity rather than pedagogy, the natural next steps, roughly in order of payoff:

A pager backed by pwrite. This is db-11. Once you have a pager the snapshot becomes the file, not an ad-hoc byte stream.
Page-level checksums. Even just XXH3 per page; turns silent corruption into noisy corruption.
A WAL. Append-only journal of operations, replay on open. db-03 does the WAL; the fusion is in db-23.
Schema and DDL. CREATE TABLE, multiple tables, column types. The single-table assumption hides almost all the catalog complexity.
A query planner. Even just a cost-based decision between SELECT_BY_K and SELECT_BY_TAG would be educational. With one table and two indexes the planner is trivial; with joins it explodes.

What this engine could become with concurrency

The MVCC bookkeeping is already there — created_at and deleted_at. What is missing for real read-mostly concurrency:

A reader-visible snapshot timestamp, so SELECT reads consistently as of "the latest committed txid I saw".
Write-set tracking and a commit barrier, so two writers cannot both bump txid without serialising.
Garbage collection of tombstoned rows once no live reader could observe them. The current code holds tombstones forever, which is fine for a benchmark and disastrous for a real system.

The interesting thing is that the snapshot wire format would still work — you would just be dumping a consistent point rather than the literal in-memory state.

What this engine could never be

Without on-disk persistence, this is not a database; it is a test fixture. Adding the pager moves it to "embedded KV with SQL", which is roughly what SQLite is.

It will never be a server. Network protocols, connection management, client-side query plans, authentication — none of that is in scope for any lab in this series, by design.

Useful tangent: cross-language byte identity as a discipline

This lab is a microcosm of a discipline that pays off elsewhere:

gRPC and protobuf rely on a canonical encoding for hash-based signing.
Git's object hashing depends on a canonical layout per object type.
Bitcoin transactions are SHA-256-d in a canonical byte form.

Whenever you find yourself asking "is this implementation correct", producing a canonical byte stream from each implementation and hashing it is one of the cheapest mechanical proofs you can buy.

Goal

Build the in-memory primary container. By the end of this step you should have:

a Row { v, tag, created_at, deleted_at } value type,
a Conn with next_txid starting at 1 and an ordered primary: i64 -> Row map,
exec_insert (UPSERT) bumping next_txid every call,
select_by_k returning live rows only.

Even though we are not implementing a real on-disk pager here, the discipline of treating the primary map as the single source of truth for both visible and tombstoned state mirrors what a pager gives you: a flat, ordered store you can walk in key order.

If you wanted, you could later swap the in-memory std::map for a B-tree built on top of db-11's pager and not have to change anything else in this lab.

Tasks

Define Row and Conn in your chosen language.
Implement exec_insert with UPSERT semantics. Make sure that inserting at the same key twice replaces the row and uses the new txid in created_at.
Implement select_by_k. It must filter out tombstoned rows.
Write a unit test that inserts two keys, selects them back, and asserts next_txid == 3.

Pitfalls

If you use a hash map (Go map, C++ unordered_map), the wire test in step 03 will fail because iteration order will not be deterministic. Use an ordered map (BTreeMap, sorted-keys iteration, std::map).
Use i64 for k. i32 will silently truncate when the workload in step 03 mods a u64 by keys and casts.

step 02 — SQL surface and MVCC

Goal

Add exec_update, exec_delete, select_by_tag, and the secondary index. By the end of this step:

exec_update must be a no-op (and not bump next_txid) if the row is missing or tombstoned. If present, it keeps the original created_at and only mutates v and tag.
exec_delete must be a no-op if the row is missing or tombstoned. If present, it sets deleted_at = next_txid and removes the row from the secondary index. The row stays in the primary.
The secondary index tag -> sorted Vec<i64> is maintained on every mutating op. Only live rows are present.
select_by_tag(tag) returns the secondary list, or empty.

Tasks

Wire exec_update and exec_delete with the no-op-on-missing rule. Test it by calling each on a key that does not exist and asserting next_txid did not move.
Implement secondary insertion as sorted insert (binary search + shift, or BTreeMap::entry().or_default() + sorted insert).
Implement secondary removal as sorted lookup + erase. If the list becomes empty, drop the tag entirely (otherwise the snapshot will carry empty entries and diverge from the spec).
Add a test that inserts three rows with the same tag in scrambled key order, then asserts select_by_tag returns them in ascending order.
Add a test for the resurrection path: insert, delete, insert again on the same key. The new row must have a fresh created_at and deleted_at == 0.

Pitfalls

The most common bug is bumping next_txid in exec_update even on a no-op. The unit tests in one language will pass; the cross-language hash will diverge after the first missing-key update.
Forgetting to drop an empty tag from secondary after the last delete will add a zero-length entry to the snapshot dump and break cross-language byte equality.
In C++, std::map::operator[] default-constructs missing entries silently — use find for reads and [] only when you intend the insert.

step 03 — cross-language snapshot

Goal

Produce the canonical snapshot byte stream defined in ../CONCEPTS.md, run the deterministic workload in each language, and assert byte-identical SHA-256 across Rust, Go, and C++.

By the end of this step:

dump_snapshot exists in every language and produces bytes that match the spec section-for-section.
A run_workload(seed, ops, keys, scenario) function exists in every language and is bit-exact.
The CLI prints the hex SHA-256 with no trailing newline.
scripts/verify.sh ends with === OK ===.
scripts/cross_test.sh ends with === ALL OK === and reports both golden hashes for scenarios A and B.

Tasks

Implement dump_snapshot. Build it incrementally: write the magic + header first, get a single-row dump matching by hand, then add the secondary section.
Implement splitmix64 and a stateful SplitMix64::next(). Pin a test for splitmix64(0) == 0x8b57dafca0cee644 to guard against constant typos.
Implement run_workload per the rules in CONCEPTS.md. Pay special attention to: drawing all three rng words even for read ops; the kind decoding (r1 >> 60) & 0x7; the modulo casts to i64.
Implement sha256_hex. In Rust use the sha2 crate. In Go use crypto/sha256 + encoding/hex. In C++ inline the reference implementation (FIPS 180-4) — keep it in the same translation unit as the engine to avoid a third-party dependency. Pin SHA256("") and SHA256("abc") in tests.
Wire up the CLI: sqlitectl workload --seed N --ops N --keys N --scenario S. Print the hex with print! / fmt.Print / std::cout — no newline.
Run scripts/verify.sh then scripts/cross_test.sh. Iterate until both end with their success markers.

Debugging a divergence

If cross_test.sh shows different hashes between languages, follow the ladder in ../docs/verification.md: shrink the op count, dump the raw snapshot bytes with xxd, diff, and look for the first differing byte. It almost always points at a section boundary that exposes either map-iteration order or a wrong-width cast.

Acceptance

All three unit suites pass under release optimisation.
Both === OK === and === ALL OK === markers appear.
Scenario A hash: e8ccacd39d8535c1ed101f0bc8b7a0799f56468a384da9284d4768cd8b3a3aab.
Scenario B hash: dd1d6bb7fec1ffc9f71f01e75a58166b04517a669495af2aa2da432d4722db69.

db-16 — Distributed-Fundamentals

This lab builds the vocabulary the rest of the distributed track (db-17 Raft, db-18 Paxos, db-19 ZAB, db-20 distributed-kv) will speak in: logical clocks, vector clocks, the happens-before relation, and a deterministic discrete-event simulator that produces a byte-identical event log across three independent implementations (Rust, Go, C++).

If you cannot write a simulator whose output is bit-stable across runs and across languages, you cannot run reproducible distributed-systems experiments. Every other lab in the track will reuse the discipline established here.

What is it?

A distributed system is a collection of nodes that exchange messages over an asynchronous, lossy network. Three primitives let us reason about such systems without having a wall-clock everyone agrees on:

Lamport clock — a single integer per node that is incremented on every local event, stamped onto each outgoing message, and bumped to max(self, incoming) + 1 on receive. Lamport (1978) proved that this discipline produces a total order consistent with causality: if event a happens-before event b, then ts(a) < ts(b). The reverse is not true.
Vector clock — one counter per node, packaged into a map. Local event increments the owner's counter; receive does pointwise max(self, incoming) then increments the owner's counter. The resulting partial order is the happens-before relation: two events are concurrent iff neither clock dominates the other.
Deterministic discrete-event simulator — a single-threaded loop that drives sim time forward in integer ticks, delivering messages whose delivery_time == t before letting nodes act. With a seeded PRNG and canonical message ordering, the same (seed, nodes, rounds) triple must always produce the same event log — in any language.

Why does it matter?

Raft (db-17), Paxos (db-18), ZAB (db-19) all rely on causality: a leader can only commit an entry after it has been replicated to a quorum of followers. Vector clocks give us the language to prove that a particular log entry could not have been committed before a prerequisite was acknowledged.
Reproducibility is the difference between "I think my consensus algorithm is correct" and "I have an event log I can re-run on someone else's machine and get the same answer." When db-17 develops a leader-election bug under network partition, the first thing you reach for is a deterministic replay of the failure.
Three independent implementations forces clarity. Any ambiguity in the spec ("when do you read the clock vs. increment it?") will show up as a byte diff in scripts/cross_test.sh. Pinning the wire format and the scheduling rule is the lab.

How does it work?

Lamport rule

local event :  self += 1
send        :  self += 1 ; stamp message with self
recv(m)     :  self = max(self, m.ts) + 1

Vector-clock rule

local event(i)    :  vc[i] += 1
send(i)           :  vc[i] += 1 ; stamp message with snapshot of vc
recv(i, m)        :  for k in m.vc : vc[k] = max(vc[k], m.vc[k])
                     vc[i] += 1
partial order     :  vc_a < vc_b   iff (∀k) vc_a[k] ≤ vc_b[k]  AND  vc_a ≠ vc_b
                     vc_a || vc_b  iff neither <  nor  > nor =

Simulator loop

for t in 0 .. rounds + MAX_DELAY:
    # 1. deliver — strict (delivery_time, sender_id, seq) order
    while heap.top().delivery_time == t:
        msg = heap.pop()
        node[msg.dest].recv(msg)
        emit Recv

    # 2. send — only during the active window
    if t < rounds:
        for s in 0 .. nodes:
            r       = splitmix64(seed ^ (t<<32) ^ (s+1))
            dest    = ((r          & 0xFFFF) % (nodes - 1)) ; skip self
            delay   = 1 + ((r>>16) & 0xFFFF) % 3
            payload = (r>>32) & 0xFF
            node[s].send_to(dest, delay, payload)
            emit Send

The two phases (deliver-then-send) per tick, the strict heap ordering, and the splitmix64 PRNG together guarantee determinism.

Canonical wire format

file := magic[4="DSE6"] u32_le(event_count) event*

event :=
    u8  kind                  # 1 = Send, 2 = Recv
    u64_le sim_time
    u32_le node               # sender for Send, receiver for Recv
    u32_le peer               # dest for Send, source for Recv
    u64_le lamport            # value AFTER the local step
    u32_le vc_len
    [u32_le node, u64_le counter] * vc_len   # sorted ASC by node
    u32_le payload_len
    u8 payload[payload_len]

All multi-byte numbers are little-endian. Vector-clock entries must be serialized in ascending order by node-id; this is the single most common source of byte-diff bugs.

Cross-language invariants

Invariant	Why it matters
splitmix64 mix `seed ^ (t<<32) ^ (s+1)`	identical PRNG stream
dest skip-self: `if pre >= s then pre+1`	identical destination choice
heap order `(delivery_time, sender, seq)`	identical delivery order
`seq` is global monotonic	deterministic tie-break across nodes
VC entries sorted by node-id on the wire	byte-identical serialization
all integers little-endian	byte-identical on every host

If any one of these drifts, scripts/cross_test.sh will fail at the sha256 compare and cmp -l will print the byte offset of the first divergence.

Files

src/rust/ — distfund16 crate + simctl binary.
src/go/ — module github.com/10xdev/dse/db16 + cmd/simctl.
src/cpp/ — db16_lib static library + simctl binary + test_db16.
scripts/verify.sh — runs the unit tests for all three.
scripts/cross_test.sh — proves the three binaries produce byte-identical event logs for two seeded scenarios.

See docs/ for the longer write-up, and steps/ for the staged implementation path.

db-16 — References

Primary sources

Leslie Lamport, Time, Clocks, and the Ordering of Events in a Distributed System, CACM 21(7), 1978. The original paper. Defines happens-before, the logical clock, and (in §4) the construction of a total order consistent with causality. https://lamport.azurewebsites.net/pubs/time-clocks.pdf
Colin Fidge, Timestamps in Message-Passing Systems That Preserve the Partial Ordering, 11th ACSC, 1988. Introduces vector clocks and proves they characterize the happens-before relation exactly.
Friedemann Mattern, Virtual Time and Global States of Distributed Systems, 1989. The companion vector-clock paper; reads more approachably than Fidge.
Sebastiano Vigna, splitmix64 — a small, fast, well-distributed 64-bit mixer used as the seeder for xoroshiro. https://prng.di.unimi.it/splitmix64.c

Determinism and reproducibility

Frans Kaashoek et al., Eraser: A Dynamic Data Race Detector for Multithreaded Programs, SOSP 1997. Not directly cited here, but the motivation — "if you cannot replay a bug deterministically you cannot debug it" — is the entire reason this lab exists.
FoundationDB's simulation testing (Apple/Snowflake) — a production example of deterministic discrete-event simulation at scale. https://apple.github.io/foundationdb/testing.html
Jepsen — Kyle Kingsbury's distributed-systems testing harness. Not deterministic itself (it injects real faults), but the methodology of "generate events, observe a history, check it against a model" is the vocabulary db-16 sets up. https://jepsen.io/

Production engines that use these primitives

Riak / Dynamo — vector clocks for sibling reconciliation.
CRDTs (Shapiro, Preguiça, Baquero, Zawirski, 2011) — vector clocks and version vectors are the substrate for state-based merge functions.
TLA+ — Lamport's specification language; ordering events by Lamport clock is the mental model behind every TLA+ refinement proof.

Cross-lab dependencies

This lab has no upstream dependencies. It is the bedrock for the distributed track.
db-17 Raft consumes the simulator: leader-election scenarios and log-replication invariants will be expressed as scripted event sequences run against a deterministic transport built on top of db-16.
db-18 Paxos, db-19 ZAB, db-20 distributed-kv reuse the same vocabulary (Lamport/VC for causality assertions, deterministic scheduler for fault-injection replay).

db-16 — Analysis

Required invariants

Lamport monotonicity. For any node n, the sequence of Lamport values produced by its successive tick/send/recv calls is strictly monotonic.
Lamport consistency with happens-before. If a → b (happens- before), then ts(a) < ts(b). The converse does not hold; that is the cost of compressing a partial order into a single integer.
Vector-clock characterization. With vector clocks the biconditional holds: a → b iff vc(a) < vc(b) componentwise (and vc(a) ≠ vc(b)).
Send-precedes-receive. Every Recv event in the simulator is paired with exactly one Send event from (peer → node) whose sim_time is strictly less than the Recv's sim_time and whose vector clock is strictly less than the Recv's.
Byte determinism. For every (seed, nodes, rounds), the three binaries produce identical bytes on stdout. This is the single property scripts/cross_test.sh checks; if it ever drifts, all downstream labs lose reproducibility.

Design decisions

Two-phase tick (deliver-then-send). Each integer tick first drains all in-flight messages whose delivery_time has arrived, then runs every node's send. Doing deliver first means a single tick can witness a message being received and a response being sent — capturing causal flow without needing finer time resolution.
Heap ordered by (delivery_time, sender, seq). The third field (seq, a global monotonic counter) gives an unconditional tie-break even when two nodes send to the same destination in the same tick with the same chosen delay.
splitmix64 seeded per (seed, t, s). A single splitmix64 call produces all three random fields (dest, delay, payload) for one (t, s) decision. This avoids the question "whose RNG state advances first" — there is no shared RNG state at all.
Vector-clock entries sorted on the wire. BTreeMap in Rust, sorted-key iteration in Go, std::map in C++ all produce ascending order naturally. If you ever switch the Rust side to HashMap you will get byte diffs.
Lib + thin CLI. All three implementations expose the same trio of primitives (Lamport, VectorClock, simulate/Simulate) as a library. The CLI is ten lines that calls serialize(simulate(...)) and writes to stdout. Downstream labs will link the library, not shell out to the CLI.

Why three languages

Forces the spec to be unambiguous. A Rust BTreeMap and a C++ std::map both happen to iterate in key order; the moment you reach for Go's map you discover the language does not and you must sort explicitly. That kind of discovery only happens with multiple implementations.
Pins endianness, integer overflow semantics (wrapping), and signed-vs- unsigned modulo. Splitmix64 in particular depends on unsigned wrapping multiplication; expressing it identically in three languages is a forcing function.
Future-proofs the track. db-17 onwards will pick one host language per experiment; having a reference implementation in three independent languages means a port bug in db-17's Raft simulator can be cross- checked against the db-16 baseline.

Tradeoffs worth flagging

Sim time is integer ticks, not floating-point seconds. This trades realism for determinism. Real networks have continuous-time jitter; capturing that would require an event-priority structure keyed by a rational/decimal time, which is not worth the complexity for a study lab.
All sends are unicast and always succeed. We do not model drops, reorderings beyond delay-based interleaving, or partitions. db-17 will add a partition primitive on top of this simulator; doing it here would mean adding --drop-rate to the CLI and changing the wire format, which would lock in a poor abstraction.
Each node sends exactly one message per tick during the active window. That is a fixed-load workload. Variable-load (silent nodes, bursty senders) would be a strict extension; it is intentionally omitted to keep the spec small enough to verify by hand.

db-16 — Execution

One-shot: prove the lab works

cd db-16-distributed-fundamentals
./scripts/verify.sh        # all unit tests in Rust, Go, C++
./scripts/cross_test.sh    # byte-identical event logs across all three

A green run of cross_test.sh ends with the literal line:

=== ALL OK ===

Per-language workflows

Rust

cd src/rust
cargo test                # 7 tests
cargo build --release     # produces target/release/simctl
./target/release/simctl --seed 42 --nodes 3 --rounds 20 > /tmp/log_rust.bin

Go

cd src/go
go test ./...             # 7 tests
go build -o /tmp/simctl_go ./cmd/simctl
/tmp/simctl_go --seed 42 --nodes 3 --rounds 20 > /tmp/log_go.bin

C++

cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build   # test_db16 → "db-16 C++ tests: 7 passed"
./build/simctl --seed 42 --nodes 3 --rounds 20 > /tmp/log_cpp.bin

CLI

All three binaries accept the same flags:

flag	default	meaning
`--seed N`	0	splitmix64 seed
`--nodes K`	3	number of nodes; must be ≥ 2
`--rounds R`	20	number of send-rounds (sim time runs for R + MAX_DELAY ticks)

The output is the binary wire format described in CONCEPTS.md. Pipe to a file; do not display on a terminal.

Canonical scenarios

scripts/cross_test.sh runs two scenarios; their sha256s are checked into the lab's verification path:

label	args	sha256
A	`--seed 42 --nodes 3 --rounds 20`	`0d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc797`
B	`--seed 7 --nodes 5 --rounds 50`	`321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d4`

If you change any of: PRNG, scheduler order, wire format, or VC entry ordering — these hashes will change and you must update both the script and this table in the same commit. That synchronization step is the forcing function that keeps the spec honest.

Sanity checks

# magic bytes
./target/release/simctl --seed 42 --nodes 3 --rounds 20 | xxd -l 8
# expect:  00000000: 4453 4536 7800 0000  DSE6x...
# 0x78 = 120 = events: 60 Sends + 60 Recvs for nodes=3 rounds=20

# event count
./target/release/simctl --seed 42 --nodes 3 --rounds 20 | \
  python3 -c 'import sys,struct; d=sys.stdin.buffer.read(); print(struct.unpack("<I", d[4:8])[0])'
# → 120

db-16 — Observation

What does the simulator's output actually look like, and how do you read it by hand?

offset 0x00 :  44 53 45 36                 "DSE6"   (magic)
offset 0x04 :  78 00 00 00                 120      (event_count, u32 LE)

For --seed 42 --nodes 3 --rounds 20 the event count is 3 nodes × 20 rounds × 2 (send + recv) = 120.

A single Send event

Every Send is the start of a causal arc; every Recv is its endpoint. The first event in scenario A is a Send from node 0 at sim_time 0:

01                       kind = 1 = Send
00 00 00 00 00 00 00 00  sim_time   = 0
00 00 00 00              node       = 0    (sender)
?? 00 00 00              peer       = ?    (destination, computed from PRNG)
01 00 00 00 00 00 00 00  lamport    = 1    (Send rule: self += 1, then stamp)
01 00 00 00              vc_len     = 1
00 00 00 00 01 00 00 00 00 00 00 00   (node=0, counter=1)
01 00 00 00              payload_len = 1
??                       payload byte

Note the vector clock for a node that has only sent has a single entry (its own counter). Receivers' vector clocks grow as they merge incoming clocks.

A single Recv event

Recvs look identical except kind = 2 and peer is the source node:

02                       kind = 2 = Recv
?? ?? ?? ?? ?? ?? ?? ??  sim_time   = original send time + delay
01 00 00 00              node       = 1   (receiver)
00 00 00 00              peer       = 0   (sender of paired Send)
?? ?? ?? ?? ?? ?? ?? ??  lamport    = max(self_before, incoming) + 1
02 00 00 00              vc_len     = 2
00 00 00 00 01 00 00 00 00 00 00 00   merged entry for node 0
01 00 00 00 ?? 00 00 00 00 00 00 00   own counter, incremented
01 00 00 00              payload_len = 1
??                       payload byte (copied from send)

The number of VC entries grows as a node hears from new peers; in a 3-node, 20-round run each receiver will eventually have all 3 entries.

Hex walkthrough

./simctl --seed 42 --nodes 3 --rounds 20 | xxd | head

Read column-by-column:

00000000: 4453 4536 7800 0000  DSE6 . . . . . . . .      header
00000008: 01 00 00 00 00 00 00 00 00                      first Send: kind=1, sim_time=0
                                  00 00 00 00            node=0
00000014: ?? 00 00 00                                     peer
00000018: 01 00 00 00 00 00 00 00                         lamport=1
00000020: 01 00 00 00                                     vc_len=1
00000024: 00 00 00 00 01 00 00 00 00 00 00 00             vc entry (0 → 1)
00000030: 01 00 00 00                                     payload_len=1
00000034: ??                                              payload byte
00000035: 02 ...                                          next event (probably another Send at t=0)

The whole file for scenario A is 8156 bytes; scenario B is 45592 bytes.

What to learn from looking at it

Lamport values are non-decreasing within a node but may regress between nodes — that is healthy: nodes 0 and 1 can be ahead of node 2 if 2 hasn't sent or received yet.
The vector-clock entry for node i in node i's own events is strictly monotonic.
For any Send/Recv pair, the Recv's VC must dominate the Send's VC (> in VcOrd). This is exactly what check_causality asserts.
If you sort all events by sim_time you get a globally consistent "tape" — but events at the same sim_time are concurrent and have no inherent ordering between nodes. Deliveries are scheduled before sends within a tick by simulator policy, not by physics.

Cross-language reading

scripts/cross_test.sh prints the hex of the first 8 bytes (44534536 7800 0000 for scenario A). If three implementations agree on those 8 bytes but disagree on the rest, the suspect is almost always either (a) VC-entry order on the wire, or (b) heap tie-break by sender id.

db-16 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake ≥ 3.20.
Rust toolchain ≥ 1.74.
Go ≥ 1.22.
shasum, xxd, awk (default on macOS; coreutils on Linux).

One command

cd db-16-distributed-fundamentals
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-16-distributed-fundamentals/src/rust
cargo test --quiet
cargo build --release

Expected: 7 passed; 0 failed. The simctl binary lands in target/release/simctl.

Go

cd db-16-distributed-fundamentals/src/go
go test ./...
go build ./cmd/simctl

Expected: ok github.com/10xdev/dse/db16 <duration>.

C++

cd db-16-distributed-fundamentals/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and test_db16 prints "db-16 C++ tests: 7 passed".

What "green" means

A green run guarantees:

All 21 unit tests pass (7 each in Rust, Go, C++) covering Lamport monotonicity, vector-clock partial order including the Concurrent case, simulator determinism on a fixed seed, and causality of the generated event log.

The cross-language test produces byte-identical event logs for both canonical scenarios:

scenario	sha256	size
A `--seed 42 --nodes 3 --rounds 20`	`0d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc797`	8 156 B
B `--seed 7 --nodes 5 --rounds 50`	`321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d4`	45 592 B

Matching sha256s prove that all three implementations agree on the PRNG, the scheduling rule, the Lamport / vector-clock update rules, the VC entry ordering on the wire, and the integer endianness.

The spot-check in cross_test.sh confirms the magic header 44 53 45 36 and the expected u32 LE event count, guarding against the regression where all three implementations agree on producing empty output.

When verification fails

Cross-language sha256 mismatch on the first 8 bytes — magic / count drift. Almost always a count formula bug (2 × nodes × rounds).
Mismatch past byte 8 but matching on a smaller --rounds — the PRNG or the scheduler diverges as soon as a recv-in-flight overlaps with a send. Inspect splitmix64 and the heap tie-break.
Causality test fails in one language only — that language's recv does not bump its own counter, or bumps before the merge. Read the Vector-clock rule in CONCEPTS again.
One language passes locally but the cross-test diverges — most often: VC entries serialized in insertion order rather than sorted by node-id. Switch to BTreeMap / std::map / explicit sort.Slice.

db-16 — Broader Ideas

Where the primitives in this lab show up in real systems, and what to build on top of them in the rest of the distributed track.

Immediate next labs

db-17 — Raft. Reuses the deterministic simulator wholesale. Adds a Role ∈ {Follower, Candidate, Leader}, election timeouts, AppendEntries RPCs, and a commit index. Every step's safety argument ultimately reduces to "this state could not have been reached without a quorum acknowledgement", which is a happens-before statement on the log — exactly what vector clocks make precise.
db-18 — Paxos. Same harness, different message types (Prepare/Promise/Accept/Accepted). Paxos's invariants are notoriously hard to reason about by hand; a deterministic simulator that can replay a counterexample seed is the difference between "I think it's correct" and "I have evidence".
db-19 — ZAB. Adds a strict total order on broadcasts and a recovery phase. The Lamport clock generalizes to the ZAB epoch + counter pair.
db-20 — Distributed KV. Wraps a quorum-replicated key-value store around a chosen consensus engine. Now the simulator's "payload" is a client command, and the event log is auditable per-replica state.

How this lab's pieces map to real systems

Lamport clocks are the kernel of Kafka offsets, Spanner's TrueTime (kind of — Spanner adds a real-time uncertainty interval but the underlying scalar is a Lamport-like ID), and Cassandra's per-cell write timestamps.
Vector clocks are the kernel of Amazon Dynamo's conflict detection, Riak's siblings, and the CRDT literature's "stable causal delivery" layer.
Deterministic discrete-event simulation is how FoundationDB developed and continues to harden its storage and replication code (Will Wilson's Testing Distributed Systems with Deterministic Simulation talk at Strange Loop 2014 is the canonical reference). It is also how TigerBeetle, Polar Signals, and Antithesis test their production code paths.
The (time, sender, seq) heap tie-break is the same trick used by every event-loop sim from simpy to game-engine fixed-timestep loops.

Performance experiments worth running later

Crank --nodes and --rounds and plot wall time vs. event count for each language. With the current canonical serializer this should be linear in events; any quadratic growth means the wire format or the heap is doing something dumb.
Replace the unicast splitmix64 destination with a broadcast and measure the explosion in VC entries per receive (each broadcast forces every other node's VC to grow by one entry).
Try a HashMap-based VC in Rust and observe cross_test.sh failing. This is the cheapest possible lesson on why deterministic iteration order matters; do it once and you will never forget it.

What "production-quality" would require beyond this lab

A real network layer (TCP or QUIC), with retries, timeouts, and application-level acks rather than the simulator's deliver-and-forget.
Lossy / reordering channels and partition primitives. db-17 will add partitions as a Network::partition(a, b) toggle; this lab deliberately omits them so the determinism story is small.
Persistent storage for clocks (so a crash-restart doesn't replay Lamport from 0). The Raft lab in db-17 will need this; the WAL we built in db-03 is the obvious substrate.
Compact vector clocks (interval tree clocks, dotted version vectors) for systems with > thousands of nodes; the naive map-based VC here becomes a bandwidth problem at that scale.

None of these change the shape of the primitives — they make the same primitives faster, leaner, and tolerant of real-world failures.

db-16 step 01 — Logical clocks

Goal

Implement Lamport and vector clocks as first-class types in all three languages, with identical semantics under a small, well-defined API.

Tasks

Lamport clock. A wrapper over a u64 counter exposing:
- tick() — bump the counter, return the new value.
- send() -> u64 — equivalent to tick: bump, then return the stamp for the outgoing message.
- recv(incoming: u64) — self = max(self, incoming) + 1.
- value() -> u64.
Vector clock. A wrapper over Map<u32, u64> exposing:
- tick(self_id) — vc[self_id] += 1.
- send(self_id) -> Self — bump own counter, return a clone of the full VC (the snapshot that gets stamped onto a message).
- recv(self_id, incoming: &Self) — pointwise vc[k] = max(vc[k], incoming[k]) for every key k in incoming, then bump the receiver's own counter.
- partial_cmp(other) -> {Less, Equal, Greater, Concurrent}. Pure function over the two maps.
Wire serialization for the VC. Entries on the wire MUST be sorted ascending by node_id. This is non-negotiable — it is the single biggest source of byte-diff bugs across languages.

Acceptance

Inline unit tests in each language:

lamport_tick_monotonic — three ticks produce 1, 2, 3.
lamport_recv_jumps — recv(10) after value 3 yields 11.
vc_partial_order_less — {0:1} < {0:1, 1:1}.
vc_partial_order_concurrent — {0:2, 1:0} and {0:0, 1:2} are concurrent.
vc_recv_merges_then_ticks — recv(self=1, {0:5, 1:0}) from initial {1:2} yields {0:5, 1:3}.
vc_serialize_sorted — the bytes are identical no matter what order entries were inserted in the map.

All six green in Rust, Go, and C++.

Discussion prompts

Why does recv bump the receiver's own counter after the merge rather than before?
Why is the "Concurrent" outcome of partial_cmp necessary; what goes wrong if you collapse it into Equal or Less?
For a system with one million nodes, is a map-keyed VC still practical? What data structures replace it (hint: interval tree clocks, dotted version vectors)?

db-16 step 02 — Deterministic simulator

Goal

Build a discrete-event simulator whose (seed, nodes, rounds) triple completely determines its event log, and produce a canonical serialization of that log.

Tasks

PRNG. Implement splitmix64 in each language with unsigned wrapping multiplication. Seed it per-decision with seed ^ (t << 32) ^ (s + 1) so that no shared mutable PRNG state crosses a (t, s) boundary. This eliminates the "whose turn is it to read the RNG?" ambiguity that bites every multi-language implementation.
Per-tick decision. For each (t < rounds, s ∈ 0..nodes), compute:
- dest_pre = (r & 0xFFFF) % (nodes - 1) then skip-self: dest = dest_pre + (1 if dest_pre >= s else 0).
- delay = 1 + ((r >> 16) & 0xFFFF) % 3.
- payload = (r >> 32) & 0xFF.
Scheduler. Maintain a min-heap of in-flight messages keyed on (delivery_time, sender_id, global_seq). global_seq is a single monotonic counter incremented every time a message is enqueued. This guarantees a total order even when two messages have identical (delivery_time, sender_id).
Tick loop. For t in 0 .. rounds + MAX_DELAY:
1. Drain all heap entries with delivery_time == t: for each, run recv on the destination node, emit a Recv event.
2. If t < rounds: for each s in 0..nodes, compute the decision, enqueue the message, run send on the sender, emit a Send event.
Wire format. As documented in CONCEPTS.md. Magic "DSE6", u32 LE event count, then event_count events. Each event uses little-endian integers and serializes its vector clock with entries sorted ascending by node id.

Acceptance

Inline unit tests:

splitmix64_known_values — for seed=0, the first three outputs are 0xE220A8397B1DCDAF, 0x6E789E6AA1B965F4, 0x06C45D188009454F.
sim_deterministic_one_node — --nodes 2 --rounds 3 --seed 1 produces a fixed event count and a fixed first-event byte sequence.
sim_event_count_formula — for any (nodes ≥ 2, rounds ≥ 1), total events = 2 * nodes * rounds (every send becomes exactly one recv).
causality_holds — after running simulate(...), walk the event log: every Recv from peer has a strictly-greater VC than the paired Send.
byte_round_trip — serializing the same event log twice yields identical bytes (no nondeterminism in the serializer itself).

All five green in Rust, Go, and C++.

Discussion prompts

Why deliver before send within a single tick?
What breaks if global_seq is per-sender instead of global?
The simulator never drops or reorders messages beyond delay-based interleaving. What new wire-format field would --drop-rate p need, and would it break the cross-language hash if defaulted to 0?

db-16 step 03 — CLI and cross-language byte-identity

Goal

Build a simctl CLI in all three languages, then prove via sha256 that all three produce byte-identical event logs for the same (seed, nodes, rounds) triple — for at least two distinct scenarios.

CLI contract

simctl --seed N --nodes K --rounds R

Writes the canonical wire-format bytes (no trailing newline) to stdout.

Tasks

Build simctl in Rust (src/rust/src/bin/simctl.rs), Go (src/go/cmd/simctl/main.go), and C++ (src/cpp/src/simctl.cc).
Write scripts/verify.sh that runs unit tests in all three langs.
Write scripts/cross_test.sh that:
1. Builds all three binaries.
2. Scenario A: simctl --seed 42 --nodes 3 --rounds 20 → sha256 all three outputs → assert all three match.
3. Scenario B: simctl --seed 7 --nodes 5 --rounds 50 → sha256 all three → assert all three match.
4. Spot-check the first 8 bytes of scenario A's output equal the magic "DSE6" plus the u32 LE count 120.
5. Print === ALL OK ===.

Acceptance

$ scripts/verify.sh
=== rust === ... ok
=== go   === ... ok
=== cpp  === ... ok
=== OK ===

$ scripts/cross_test.sh
...
  match(A): 0d7e753cdc891e3a481977da372a4d97a6a0e0ab00b74f5a074dbc25791dc797
  match(B): 321221187709684afd59c55202f8d373dad33c8026e933b36740aeed23c8c2d4
=== ALL OK ===

A byte-identical hash across three independent implementations is a near-proof that the PRNG, scheduler, clock-update rules, and wire format are all spec-compliant. Any divergence — even on a single byte — will surface here.

Discussion prompts

Why two scenarios instead of one? What property would slip through with a single scenario that two catch?
If the scenario-A hash matches but scenario B does not, where in the codebase would you start looking?
The sha256 hashes are baked into the script as constants. What's the benefit, and what's the maintenance cost when the wire format legitimately evolves (e.g., adding a new event kind)?

db-17 — Raft

This lab implements Raft consensus in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It builds directly on the deterministic-simulator discipline from db-16: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule.

If db-16 taught you to keep an event log bit-stable across three languages, db-17 teaches you to keep an entire replicated state machine's persistent state bit-stable across three languages and across network partitions. Every later distributed lab (db-18 Paxos, db-19 ZAB, db-20 distributed-kv) is a variation on this skeleton.

What is it?

Raft (Ongaro & Ousterhout, USENIX ATC 2014) is a consensus algorithm that keeps an ordered, append-only replicated log consistent across a cluster of nodes despite crashes, message reorderings, and arbitrary partitions of the network. It is the consensus core inside etcd, Consul, TiKV, CockroachDB, MongoDB's metadata, and many more.

Raft decomposes consensus into three sub-problems:

Leader election. Each node is one of {Follower, Candidate, Leader}. Followers run an election timeout; on timeout a follower becomes a candidate, bumps its current_term, votes for itself, and broadcasts RequestVote. A candidate that receives a majority of vote_granted=true replies in the same term becomes leader.
Log replication. The leader accepts client proposals and appends them to its log. It broadcasts AppendEntries RPCs carrying the new entries plus a prev_log_index / prev_log_term consistency check. On a mismatch the follower rejects; the leader decrements next_index[follower] and retries. Once a majority's match_index covers entry N and log[N].term == current_term, the leader advances commit_index to N.
Safety. Election restriction (a candidate only earns a vote if its log is at least as up-to-date as the voter's), the commit-only-current-term rule, and the log-matching property (identical (index, term) ⇒ identical entries) together imply state machine safety: once an entry at index i is applied at one node, no other node will ever apply a different entry at i.

This lab implements the algorithm as it appears in Figure 2 of the paper, minus snapshots and minus membership changes. The simulator drives sim time forward in integer ticks; messages are scheduled into a heap with a deterministic (delivery_time, sender, seq) order; an optional partition set drops messages in one direction between named pairs.

Why does it matter?

Raft is the production consensus algorithm of the 2010s. Knowing exactly how prev_log_index works, why commit advance is gated on log[N].term == current_term, and why the election restriction exists is the difference between operating etcd and understanding etcd.
Three byte-identical implementations forces the spec to be unambiguous. Anywhere Raft "depends on the implementation" — RPC scheduling, election timer jitter, tiebreak for "which leader gets a proposal", iteration order of peer ids — has to be pinned down. The cross-language sha256 makes drift loud.
Reproducible partitions. With a deterministic --partition s,d,... flag and a seeded simulator, you can replay the exact sequence of message drops, leadership changes, and committed entries that triggered a bug, on any machine, in any of the three languages.
Foundation for the rest of the track. db-18 Paxos and db-19 ZAB will reuse the simulator harness; db-20 distributed-kv will plug a consensus engine into a real key-value store.

How does it work?

State (per node)

persistent  : current_term : u64
              voted_for    : Option<u32>          # None == -1 on the wire
              log          : Vec<LogEntry>        # 1-indexed in Figure 2; 0-indexed here
volatile    : role         : Follower | Candidate | Leader
              commit_index : u64                  # highest log index known committed
              last_applied : u64                  # we apply lazily; rarely diverges from commit_index
leader-only : next_index   : Map<peer_id, u64>    # index of next entry to send to each peer
              match_index  : Map<peer_id, u64>    # highest entry known replicated on each peer
timers      : election_deadline : u64             # sim-time tick
              heartbeat_due     : u64             # next time leader must send AE

Election timer

reset_election_timer(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

A 150-tick base plus 150 ticks of seeded jitter avoids the classic split-vote loop. Heartbeats fire every 50 ticks.

RequestVote handling

on RequestVote(term, candidate, last_log_index, last_log_term):
    if term > current_term:                # newer term seen
        become_follower(term)
    grant = (term == current_term)
         && (voted_for is None or voted_for == candidate)
         && candidate_log_is_at_least_as_up_to_date()
    if grant:
        voted_for = candidate
        reset_election_timer()
    reply RequestVoteReply(current_term, grant)

Up-to-date is defined as: last_log_term > my_last_term, or (last_log_term == my_last_term && last_log_index >= my_last_index).

AppendEntries handling

on AppendEntries(term, leader, prev_idx, prev_term, entries, leader_commit):
    if term > current_term: become_follower(term)
    if term < current_term: reply (current_term, false); return
    reset_election_timer()
    if prev_idx > 0 && (log too short OR log[prev_idx-1].term != prev_term):
        reply (current_term, false); return        # consistency mismatch
    # truncate any conflicting suffix, then append
    for (i, e) in enumerate(entries):
        idx = prev_idx + i
        if idx < log.len() && log[idx].term != e.term:
            log.truncate(idx)
        if idx >= log.len():
            log.push(e)
    if leader_commit > commit_index:
        commit_index = min(leader_commit, log.len())
    reply (current_term, true, match_index = prev_idx + len(entries))

Commit advance (leader only)

advance_commit():
    for N in (log.len() ..= commit_index + 1).rev():
        if log[N-1].term != current_term: continue   # Figure 8 safety
        replicated = 1 + count(p : match_index[p] >= N)
        if 2 * replicated > nodes:
            commit_index = N; break

Propose (leader only)

propose(cmd):
    log.push(LogEntry{ term: current_term, command: cmd })
    match_index[self] = log.len()
    broadcast_append_entries()
    advance_commit()           # NB: required for n == 1, harmless for n > 1

The advance_commit() call inside propose is the one non-obvious detail. In a single-node cluster the leader has no peers, so no AppendEntriesReply will ever arrive to trigger a commit — but a majority is already satisfied (the leader alone is the majority). All three implementations call advance_commit() at the end of propose for byte-identical behaviour.

Simulator loop (per tick `t in 0..rounds`)

1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader  : pick (max term, min id) among Leaders; call propose
3. deliver in-flight           : pop heap entries with delivery_time == t
4. tick all nodes              : iterate in ascending id; on_tick may fire election or heartbeat

Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Deterministic, evenly spread, and independent of cluster behaviour.

Wire format (Rpc)

Four variants; all field widths fixed; little-endian:

RequestVote       { term: u64, candidate: u32, last_log_index: u64, last_log_term: u64 }
RequestVoteReply  { term: u64, granted: bool (u8) }
AppendEntries     { term: u64, leader: u32, prev_idx: u64, prev_term: u64,
                    entries: [LogEntry], leader_commit: u64 }
AppendEntriesReply{ term: u64, success: bool (u8), match_index: u64 }

The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. Only the canonical dump is serialized, and that is what gets hashed.

Canonical dump format

file := magic[8 = "DSERAFT1"] u32_le(node_count) node*

node := u32_le id
        u64_le current_term
        i64_le voted_for                # -1 if None (two's complement little-endian)
        u8     role                     # Follower=0, Candidate=1, Leader=2
        u64_le commit_index
        u32_le log_len
        entry * log_len

entry := u64_le term
         u32_le cmd_len
         u8 cmd[cmd_len]

Nodes appear in ascending id order. All multi-byte numbers are little-endian. The dump is hashed with SHA-256; the lowercase hex digest is what raftctl prints (no trailing newline).

Cross-language invariants

Invariant	Why it matters
splitmix64 constants `0x9E3779B97F4A7C15`, `0xBF58476D1CE4E7B5`, `0x94D049BB133111EB`	identical PRNG output
`election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150`	identical election firing times
`delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3`	identical message scheduling
heap order `(delivery_time, sender, seq)`; `seq` global monotonic	identical delivery sequence
peers iterated in ascending id (`BTreeMap` / `std::map` / explicit `for p:=0;p<n;p++`)	identical broadcast order
leader-pick for proposal injection: `(max term, min id)`	identical client routing
proposal schedule: `(i+1) * rounds / (K+1)` integer division	identical pending queue contents
`propose()` calls `advance_commit()`	identical commit_index for n=1
`voted_for = None` encoded as i64 LE `-1`	identical dump bytes
`Role` enum order `Follower=0, Candidate=1, Leader=2`	identical dump bytes

If any one of these drifts, scripts/cross_test.sh will fail and cmp -l on the two raw dumps will print the byte offset of the first divergence.

Files

src/rust/ — raft17 crate + raftctl binary.
src/go/ — module github.com/10xdev/dse/db17 + cmd/raftctl.
src/cpp/ — db17_lib static library + raftctl binary + test_db17.
scripts/verify.sh — runs the unit tests for all three.
scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.

db-17 — References

Primary sources

Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm (Extended Version), USENIX ATC 2014. The Raft paper. Figure 2 is the spec this lab implements; Figure 8 is the motivation for the "commit only entries of the current term" rule. https://raft.github.io/raft.pdf
Diego Ongaro, Consensus: Bridging Theory and Practice, Stanford PhD dissertation, 2014. The book-length treatment. Chapters 3–4 cover what's in this lab; chapters 5–6 cover snapshots, log compaction, and membership changes (deferred to db-21 / db-23). https://github.com/ongardie/dissertation
raft-tla — the TLA+ specification of the algorithm, also by Ongaro. Useful when you want a second, machine-checked statement of the same rules implemented here. https://github.com/ongardie/raft.tla

Implementations to read alongside

etcd/raft (Go) — the most-studied production Raft. Same Figure 2 structure; adds pre-vote, leader leases, learner replicas, ReadIndex, joint consensus. https://github.com/etcd-io/raft
hashicorp/raft (Go) — Consul's engine. Easier to read than etcd's because it carries less production scar tissue. https://github.com/hashicorp/raft
tikv/raft-rs (Rust) — TiKV's port of etcd's algorithm. Useful as a counterpoint to this lab's stdlib-only Rust version. https://github.com/tikv/raft-rs

Determinism and simulation

db-16's references on FoundationDB simulation testing and TigerBeetle apply verbatim here.
Hermitian (CockroachDB) and Antithesis are commercial deterministic simulators for distributed databases; the spirit is the same as cross_test.sh.

Background reading worth doing

Heidi Howard et al., Flexible Paxos: Quorum intersection revisited, OPODIS 2016. Helps see Raft as a specialization of Paxos with a fixed quorum intersection rule.
Lamport's Paxos Made Simple — for the db-18 transition.
Junqueira et al., ZooKeeper's Atomic Broadcast Protocol: Theory and Practice — for the db-19 transition.

Cross-lab dependencies

Upstream: db-16 distributed-fundamentals (Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale).
Downstream:
- db-18 Paxos — reuses the heap-and-tick simulator; different RPC structure; weaker leader assumption.
- db-19 ZAB — leader-based atomic broadcast; same election + log-replication skeleton.
- db-20 Distributed KV — wraps a chosen consensus engine (probably this one) around a key-value state machine.
- db-23 Capstone — joint membership changes and snapshots get added on top of this code.

db-17 — Analysis

Required invariants

Election safety. At most one leader per term. Enforced by majority voting: a candidate only becomes leader after collecting votes from a strict majority, and each voter only grants one vote per term (the voted_for field, persisted in the canonical dump).
Leader append-only. A leader never overwrites or deletes entries from its own log; it only appends. Followers may truncate on an AppendEntries consistency mismatch, but the leader's local log only grows.
Log matching property. If two logs contain an entry with the same (index, term), then the logs are identical in all entries up through that index. Enforced by the prev_log_index / prev_log_term check in AppendEntries and the truncate-on-conflict rule.
Leader completeness. If an entry is committed in term T, that entry is present in the log of every leader for all later terms. Enforced by the election restriction (a vote is only granted if the candidate's log is at least as up-to-date as the voter's).
State machine safety. If a node has applied an entry at index i, no other node will ever apply a different entry at i. This follows from log matching + leader completeness + the commit-only-current-term rule.
Byte determinism. For every (seed, nodes, rounds, proposals, partition) tuple, the three binaries produce identical canonical_dump bytes — hence identical sha256 hex on stdout. scripts/cross_test.sh checks six scenarios.

Design decisions

propose() calls advance_commit() at the end. The non-obvious one. In a single-node cluster the "leader" has no peers, so no AppendEntriesReply will ever arrive to drive advance_commit(). But a single-node cluster is its own majority, so the entry should commit the moment it is appended. Without this call, scenario D (--nodes 1) ends with commit_index = 0 despite five proposals in the log. Calling advance_commit() is harmless for n > 1 (the loop's majority check rejects until replies actually arrive).
Sorted iteration on every wire-affecting loop. Rust uses BTreeMap<u32, u64> for next_index / match_index; C++ uses std::map; Go uses explicit for p := uint32(0); p < n; p++ loops. HashMap would compile and pass single-language tests but fail cross_test.sh immediately. db-16's analysis.md called this out; db-17 enforces it across more code surface.
In-flight heap ordered by (delivery_time, sender, seq). seq is a global monotonic counter incremented every time a message is enqueued. It exists only to break ties when two messages with the same (delivery_time, sender) would otherwise be ambiguously ordered. Without seq you would see byte diffs on dense traffic at the same delivery tick.
Leader-pick for proposal injection is (max term, min id) among role == Leader nodes. During leadership churn there may be no leader, or there may be multiple stale leaders that have not yet stepped down. The (max term, min id) rule produces a deterministic routing decision no matter which language's iteration order you start from.
Proposal schedule is closed-form. schedule[i] = (i+1) * rounds / (K+1) (integer division). This places K proposals evenly through the rounds window, independent of cluster behaviour. A schedule derived from cluster state ("propose whenever there's a leader") would couple proposal timing to incidental scheduling choices and produce noisy byte diffs.
Splitmix64 constants are explicit. 0x9E3779B97F4A7C15 (γ / golden-ratio fractional, the seeder constant), 0xBF58476D1CE4E7B5 and 0x94D049BB133111EB (Vigna's two mixer constants). All three implementations copy them as literals; nobody computes them.
Library + thin CLI. The lab exposes Cluster::new, run, canonical_dump, and sha256 as a library. The CLI is a few dozen lines of arg parsing plus four function calls. Downstream labs (db-18 Paxos, db-20 distributed-kv) will link the library, not shell out.

Tradeoffs worth flagging

No snapshots, no log compaction. Logs grow without bound across the run. For --rounds 2000 --proposals 20 you end up with ~20 entries per node; the canonical dump stays small. For production Raft you would add a SnapshotState RPC and a last_included_index / last_included_term; deferred to db-21 (storage-engine-advanced).
No pre-vote, no leader lease. A network-partitioned candidate will repeatedly increment its term, then on heal will force the legitimate leader to step down. Mitigated by tight election timeouts in this simulator but a real cluster needs the pre-vote optimization (Ongaro thesis §9.6).
No membership changes. The node count is fixed at Cluster::new time. Joint consensus (and the safer learner-then-promote alternative) is a major chapter on its own; deferred to db-23 capstone.
Crash semantics are stylized. Crashes are simulated only via the partition flag (drop all messages in one direction). A real Raft must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction.
No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's pending queue is drained unconditionally; we are testing the consensus core, not the client RPC layer.

Why three languages

Same reasoning as db-16, plus one new lesson: Raft has many places where "iterate over peers" appears. Each one is a chance for a byte diff. C++'s std::map and Rust's BTreeMap are ordered by default; Go's map is explicitly randomized at iteration time. The Go implementation has explicit for p := uint32(0); p < n; p++ loops everywhere a peer iteration appears. Discovering this discipline by forcing the cross-language test to pass is more durable than reading "don't use HashMap" in a style guide.

db-17 — Execution

One-shot: prove the lab works

cd db-17-raft
./scripts/verify.sh        # all unit tests in Rust, Go, C++
./scripts/cross_test.sh    # byte-identical sha256 across all three, six scenarios

A green run of cross_test.sh ends with the literal line:

=== ALL OK ===

Per-language workflows

Rust

cd src/rust
cargo test --release       # ~10 tests
cargo build --release      # produces target/release/raftctl
./target/release/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Go

cd src/go
go test ./...              # ~12 tests
go build -o /tmp/raftctl_go ./cmd/raftctl
/tmp/raftctl_go --seed 42 --nodes 3 --rounds 1000 --proposals 5

C++

cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build   # test_db17 → "db-17 C++ tests: 10 passed"
./build/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

CLI

All three binaries accept the same flags and print lowercase hex sha256 of the canonical dump to stdout with no trailing newline:

flag	default	meaning
`--seed N`	0	splitmix64 seed mixed into election timers and message delays
`--nodes K`	3	number of Raft nodes (1 is legal; majority is then 1)
`--rounds R`	1000	number of simulator ticks to run
`--proposals P`	0	number of client commands to inject during the run
`--partition s,d,...`	none	comma-separated pairs `(src, dst)` to drop in that direction

--partition 0,1,1,0 drops both directions between nodes 0 and 1 (complete split); --partition 0,1 drops only 0 → 1 (asymmetric). Proposals are spaced as schedule[i] = (i+1) * rounds / (K+1); with --rounds 1000 --proposals 5 they fire at ticks 166, 333, 500, 666, 833.

Canonical scenarios

scripts/cross_test.sh runs six scenarios; their sha256s are listed in docs/observation.md. If any change, cross_test.sh will diff the raw dumps and exit non-zero.

label	args
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`

D exercises the single-node-leader code path that motivated the propose() → advance_commit() call. E isolates node 0 completely; the other two must elect a leader and commit the remaining proposals. F is an asymmetric partition that causes term churn but recoverable replication.

Sanity checks

# magic bytes of the canonical dump (use the lib directly; the CLI hashes it)
cat <<'EOF' | cargo run --quiet --example dump_magic
EOF
# or just trust the test: TestCanonicalDumpMagic in raft_test.go
# or for C++:   test_db17 prints "canonical dump magic OK" among its asserts

# pick any scenario and round-trip:
./src/rust/target/release/raftctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
# expect:  a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9

db-17 — Observation

What the cross-language test produces and how to read it by hand.

Expected sha256s

scripts/cross_test.sh runs six scenarios and asserts the three binaries (Rust, Go, C++) all print the same hex digest. The current canonical hashes are:

label	args	sha256
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`	`a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9`
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`	`b6dc06aee72e595f51bd5045ea7c92ffcbe7f6fda3198985f9ded1eca2671c4b`
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`	`f9db9ea7e6c1ca2b3a911b42b2431e964a4ee7c5e40e27efd29b41e747958838`
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`	`ce8b8e05d6ad0b4a243753a934b2f052c2363e97beca0c175586677d1a489408`
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`	`b1689eb48b209187b7cd82a24b1a6a2d19b0be4b481ac1a5b4f1ac9e23a6ae05`
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`	`fcc70ecabe37509133bb27155f5bd7d74981c3f98e79719e2b47077acca6a31f`

If any of these change, cross_test.sh will fail; either you have a bug, or you have intentionally changed the spec (timer constants, schedule formula, dump layout) and you must update this table in the same commit.

What the canonical dump looks like (scenario D — single node)

--seed 1 --nodes 1 --rounds 200 --proposals 5. Five proposals into a single-node cluster — leader is itself the majority, so every proposal commits immediately.

offset 0x00 :  44 53 45 52 41 46 54 31    "DSERAFT1"        magic
offset 0x08 :  01 00 00 00                 1                 node_count
offset 0x0c :  00 00 00 00                 0                 node id
offset 0x10 :  ?? ?? ?? ?? ?? ?? ?? ??     current_term      (~1, the first self-election)
offset 0x18 :  00 00 00 00 00 00 00 00     voted_for = 0     (voted for self in term 1)
offset 0x20 :  02                          role = Leader (2)
offset 0x21 :  05 00 00 00 00 00 00 00     commit_index = 5
offset 0x29 :  05 00 00 00                 log_len = 5
offset 0x2d :  XX XX XX XX XX XX XX XX     log[0].term       (== current_term)
offset 0x35 :  03 00 00 00                 log[0].cmd_len    (3 bytes: "p00")
offset 0x39 :  70 30 30                    "p00"             payload
...

Each subsequent entry is 8 + 4 + 3 = 15 bytes (term + cmd_len + "pNN"). Total dump for D is therefore approximately 0x2d + 5 * 15 = 0xa0 bytes = 160 bytes. The actual numbers vary slightly depending on how many election cycles --seed 1 produces before the first self-vote.

A multi-node dump (scenario C — quiet cluster)

--seed 99 --nodes 3 --rounds 500 --proposals 0. No proposals; the cluster elects a leader, sends heartbeats, and that is it. Every node's log is empty:

44 53 45 52 41 46 54 31         magic
03 00 00 00                     node_count = 3

00 00 00 00                     node id 0
XX XX XX XX XX XX XX XX         current_term       (1 if 0 elected itself, otherwise higher)
XX XX XX XX XX XX XX XX         voted_for           (0 for the leader, otherwise the leader id)
XX                              role                (Leader or Follower; never Candidate at quiescence)
00 00 00 00 00 00 00 00         commit_index = 0
00 00 00 00                     log_len = 0

01 00 00 00                     node id 1
... same shape ...

02 00 00 00                     node id 2
... same shape ...

Total dump: 8 + 4 + 3 * (4 + 8 + 8 + 1 + 8 + 4) = 111 bytes.

How to debug a divergence

If cross_test.sh fails, the script captures the raw dump from each language into /tmp/raft_<label>_<lang>.bin and prints which two languages diverged. Then:

cmp -l /tmp/raft_A_rust.bin /tmp/raft_A_go.bin | head
xxd /tmp/raft_A_rust.bin | sed -n '<line>,+2p'
xxd /tmp/raft_A_go.bin   | sed -n '<line>,+2p'

The first divergence offset tells you what to look at:

offset range	likely culprit
0x00–0x07	magic (typo: `DSERAFT1` not `DESRAFT1`)
0x08–0x0b	node_count (impossible if all three accept `--nodes` correctly)
inside a node block, on `current_term`	election timer or heap-order bug
inside a node block, on `voted_for`	`None` encoding (must be i64 LE `-1`)
inside a node block, on `role`	enum mapping (Follower=0, Candidate=1, Leader=2)
inside a node block, on `commit_index`	`propose()` not calling `advance_commit()`, or quorum count wrong
inside a `log` entry	AppendEntries truncate-on-conflict bug, or peer iteration order

In all six existing scenarios these checks pass; the table above is the runbook for the day someone changes the algorithm and forgets to update one of the three implementations.

Tick-level scope (Rust REPL trick)

To watch a scenario from the inside, add this temporary print in Cluster::run before the simulator loop:

#![allow(unused)]
fn main() {
if std::env::var("RAFT_TRACE").is_ok() {
    eprintln!("t={} leader={:?} terms={:?}", t,
        self.nodes.iter().find(|n| n.role == Role::Leader).map(|n| n.id),
        self.nodes.iter().map(|n| n.current_term).collect::<Vec<_>>());
}
}

then run RAFT_TRACE=1 raftctl --seed 42 --nodes 3 --rounds 1000 ... | head -50. The output is not part of the canonical dump and does not affect the sha256. Remove before commit.

db-17 — Verification

How to reproduce the green status on a clean machine.

Prerequisites

macOS or Linux with Apple Clang / clang ≥ 14 / gcc ≥ 11.
cmake ≥ 3.20.
Rust toolchain ≥ 1.74.
Go ≥ 1.22.
shasum, cmp, awk (default on macOS; coreutils on Linux).

One command

cd db-17-raft
scripts/verify.sh        # builds + unit tests in all three langs
scripts/cross_test.sh    # cross-language sha256 match across six scenarios

Both should print === OK === / === ALL OK === and exit 0.

Per-language drill-down

Rust

cd db-17-raft/src/rust
cargo test --release --quiet
cargo build --release

Expected: ~10 tests pass. The raftctl binary lands at target/release/raftctl. The release profile uses LTO.

Go

cd db-17-raft/src/go
go test ./...
go build -o /tmp/raftctl_go ./cmd/raftctl

Expected: ok github.com/10xdev/dse/db17 <duration> and a working binary. Tests include TestSha256HexKnownVectors (validates the SHA-256 wrapper against published vectors) and TestVotedForNegativeEncoding (validates the -1 sentinel byte layout).

C++

cd db-17-raft/src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Expected: 100% tests passed, 0 tests failed out of 1 and test_db17 prints "db-17 C++ tests: 10 passed". The test source #undef NDEBUG before <cassert> so assert() fires in Release builds too.

What "green" means

A green run guarantees:

Per-language unit tests pass. Each implementation independently exercises splitmix64, election-timer reset, RequestVote granting, AppendEntries truncation, single-node commit, multi-node commit, canonical dump magic + node_count + log_len framing, and the SHA-256 implementation against a known test vector.
All six scenarios produce byte-identical canonical dumps across Rust, Go, and C++. cross_test.sh actually compares the raw dump bytes (cmp -s) before comparing the sha256 hex, so a divergence is caught with an exact byte offset rather than just "the hashes don't match".
The sha256s match the table in docs/observation.md. If you change the algorithm or the dump format, both the dumps and the table must change in the same commit. The mismatch between code and docs is itself a verification failure.

What "green" does NOT guarantee

No production safety. There is no fsync; in-memory state is considered durable by construction.
No coverage of snapshot / membership / pre-vote / lease code. Those features are deferred to db-21, db-23, and possibly never (this lab is a study lab, not a production engine).
No client-facing API. Proposals are injected into the simulator via a fixed schedule; there is no Propose RPC for an external client.
No performance characterization. The lab is sized to run in under a second per scenario; multi-thousand-round runs work but are not the goal.

Invariant assertions in code

The C++ test file in particular makes the invariants concrete:

assertion	invariant
`dump.size() >= 12` and starts with `"DSERAFT1"`	dump-format magic
`read_u32_le(dump, 8) == nodes`	node_count framing
`cluster.run(...)` does not panic for any tested `(seed, nodes, rounds, P)`	no out-of-bounds / no UB
`sha256(empty) == e3b0c44298...`	SHA-256 padding boundary case
`n.commit_index <= n.log.len()` for every node after run	no over-commit
`propose` on a single-node leader yields `commit_index == proposals`	majority-of-one rule

The Rust and Go tests assert the same set in their respective testing idioms.

db-17 — Broader Ideas

Where Raft and the choices in this lab show up in real systems, and what to build on top of the same simulator harness in the rest of the distributed track.

Immediate next labs

db-18 — Paxos. Same heap-and-tick harness, different RPC structure (Prepare / Promise / Accept / Accepted), no fixed leader. Paxos's invariants are notoriously hard to reason about by hand; byte-deterministic replay of a counterexample seed is the difference between "I think it's correct" and "I have evidence". Raft was literally designed as the more understandable alternative — implementing both in this order is the recommended path.
db-19 — ZAB. ZooKeeper's atomic broadcast protocol. Similar leader-based skeleton to Raft, but the recovery phase is more involved (NEWLEADER / NEWEPOCH / SYNC / BROADCAST). The Lamport scalar of db-16 generalizes to the (epoch, counter) pair that ZAB calls a "zxid".
db-20 — Distributed KV. Wrap a quorum-replicated key-value store around a chosen consensus engine. The state machine is the only thing that changes — the consensus log feeds Put(k, v) / Delete(k) commands that get applied in commit_index order.
db-23 — Capstone. Adds snapshots, joint-consensus membership changes, and a multi-Raft "shards across regions" deployment on top of this code.

How this lab's pieces map to real systems

The Raft skeleton implemented here is exactly what etcd, Consul, TiKV, CockroachDB, MongoDB metadata, OpenStack Nova cells, and the control plane of Vault all run. They each add the extensions deferred from this lab (pre-vote, snapshots, learners, joint consensus), but the core RequestVote / AppendEntries loop is unchanged.
The (delivery_time, sender, seq) heap tie-break is the same trick FoundationDB's simulator uses to drive every commit-proxy /storage-server interaction; TigerBeetle, Antithesis, and Hermitian all reach for it independently.
The "leader picks max-term, min-id" convention surfaces as the split-brain resolution rule in production systems: when a partition heals and you see two leaders, the one with the higher term wins unconditionally (id break is academic — different terms imply different elections).
The voted_for = None encoded as -1 is the convention every Raft implementation in production uses on disk. Some encode as optional / nullable types in a richer wire format (protobuf has optional), but in any fixed-width binary log the sentinel value is the right answer.

Performance experiments worth running later

Crank --rounds to 100k and watch the binary size grow. The dump is linear in committed entries; if you ever see super-linear growth something is appending entries that don't get committed (a sign of partition oscillation).
Replace splitmix64 with a per-node rand::ChaCha20. The simulator will still be deterministic (RNGs are seeded), but cross-language byte equivalence will break unless you also port the ChaCha core identically. Useful exercise in what exactly portability requires.
Try injecting one heavy proposal vs. many small proposals into a 3-node cluster and measure the cluster dump size vs. the bytes actually committed. The difference is the steady-state replication overhead.
Vary the election timeout. The 150 + jitter(0..150) ticks chosen here keeps churn low; halve it and you'll see term numbers climb rapidly under any partition, especially scenario F.

What "production-quality" would require beyond this lab

A real disk-backed persistence layer with fsync semantics and crash recovery. The canonical dump pretends current_term, voted_for, and log are durable on every change; a real Raft must fsync before sending any reply that depends on the new state, or risk violating election safety on a power cut.
Network I/O. The simulator hands typed structs across an in-process heap; production uses gRPC or a custom framing protocol with at- least-once delivery and connection-level back-pressure.
Pre-vote and leader leases. Without them, a partitioned candidate bumps its term repeatedly; on heal the legitimate leader steps down unnecessarily. Easy to add as a wrapper on RequestVote; deferred here because it would obscure the core algorithm.
Snapshots and log compaction. Without them, the log grows forever and a slow follower can't catch up over the wire. The canonical dump tolerates this only because the lab's rounds is bounded.
Membership changes. The fixed nodes count at Cluster::new time is fine for a lab but useless in production. Joint consensus or the safer learner-then-promote protocol are major additions; covered in db-23.
Observability. A real Raft cluster exposes per-node term, commit_index, match_index[*], leader_id, election counts, and message rates as metrics. The canonical dump is a post-mortem view; runtime observability is a separate problem.

db-17 step 01 — Leader election

Goal

A cluster of nodes followers, started cold, must elect exactly one leader in a bounded number of ticks, and that leader must remain stable as long as it can deliver heartbeats. The election protocol must be byte-deterministic across Rust, Go, and C++.

Tasks

Persistent state. Each RaftNode carries current_term: u64, voted_for: Option<u32>, and log: Vec<LogEntry>. The dump encodes voted_for=None as the signed integer -1 (i64 LE); Some(id) becomes id as i64.
Election timer. reset_election_timer(t) sets election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150. Heartbeat-due is t + 50.
on_tick(t). Followers and candidates that hit election_deadline start a new election: bump current_term, vote for self, broadcast RequestVote to all peers, transition to Candidate. Leaders that hit heartbeat_due broadcast an empty AppendEntries (heartbeat).
RequestVote handling. Grant a vote iff (a) term == current_term, (b) voted_for is None or equal to the candidate, and (c) the candidate's log is at least as up-to-date as ours (the standard last_log_term/last_log_index lex compare). Grant resets the election timer.
RequestVoteReply handling. A candidate that collects a majority of granted replies in the same term transitions to Leader, initializes next_index[p] = log.len() and match_index[p] = 0 for every peer p, and immediately broadcasts AppendEntries (initial heartbeat).
become_follower(term). Used whenever a node sees term > current_term (in any RPC). Sets current_term = term, clears voted_for, resets the election timer, transitions to Follower.

Acceptance

Inline unit tests in each language:

splitmix64_known_vectors — splitmix64(0) == 0xE220A8397B1DCDAF (the value Vigna's reference C produces).
election_timer_in_range — 1000 consecutive resets all land in [t+150, t+300).
request_vote_grants_first_only — vote for candidate A, then a RequestVote from B in the same term is denied.
become_leader_from_majority — 3-node cluster, two RequestVoteReply with granted=true transitions the candidate to Leader.
term_bump_demotes_leader — a Leader receiving any RPC with term > current_term becomes Follower and clears voted_for.

All five green in Rust, Go, and C++.

Discussion prompts

Why is voted_for persistent (in the canonical dump) but commit_index volatile (also dumped, but only because the dump is a debug oracle, not a recovery file)?
What goes wrong if you reset the election timer on send of RequestVote instead of on grant of someone else's vote? (Hint: split-vote loops.)
Why must "majority" be computed against nodes, not against nodes that have replied?

db-17 step 02 — Log replication

Goal

A leader accepts client proposals, replicates them to followers via AppendEntries, and advances commit_index once a majority's match_index covers the entry and the entry is from the leader's current term. Followers truncate any conflicting suffix and append the leader's entries. The result must be byte-deterministic across all three languages.

Tasks

LogEntry. { term: u64, command: Vec<u8> }. Logs are 0-indexed in this implementation; the algorithm description uses 1-indexed in Ongaro's Figure 2 — adjust mentally when reading the paper.
propose(cmd). Leader-only:
- push LogEntry { term: current_term, command: cmd } onto own log,
- set match_index[self] = log.len(),
- broadcast AppendEntries to all peers,
- call advance_commit() (so n=1 commits immediately).
broadcast_append_entries(). For each peer in ascending id order, send AppendEntries { term, leader, prev_idx, prev_term, entries: log[next_index[p]..], leader_commit }. prev_idx = next_index[p], prev_term = log[prev_idx-1].term (or 0 if prev_idx == 0).
AppendEntries handling on follower.
- if term > current_term: become_follower(term);
- if term < current_term: reply success=false;
- reset election timer (we heard from a leader);
- if prev_idx > 0 && (log too short || log[prev_idx-1].term != prev_term): reply success=false, match_index=0;
- else: walk each incoming entry; truncate own log at the first (index, term) conflict; append remaining entries; advance commit_index = min(leader_commit, log.len()); reply success=true, match_index=prev_idx+entries.len().
AppendEntriesReply handling on leader.
- if term > current_term: become_follower(term);
- if success: next_index[from] = reply.match_index + 1; match_index[from] = reply.match_index; advance_commit();
- if !success and term == current_term: decrement next_index[from] (clamped at 0); next heartbeat / propose will retry with an earlier prev_idx.
advance_commit(). For N from log.len() down to commit_index + 1:
- if log[N-1].term != current_term: continue (Figure 8 safety);
- if 1 + count(p : match_index[p] >= N) > nodes / 2: set commit_index = N and break.

Acceptance

Inline unit tests in each language:

propose_single_node_commits — --nodes 1, propose 3 entries, every entry's term is the leader term, commit_index == 3.
append_entries_rejects_term_mismatch — leader with empty log sends AE with prev_idx=5, prev_term=1; follower returns success=false.
append_entries_truncates_conflict — follower with log=[(t=1), (t=1), (t=2)] receives AE with prev_idx=2, prev_term=1, entries=[ (t=3)]; resulting log is [(t=1), (t=1), (t=3)].
commit_requires_current_term — leader at term=5 replicates an old term=3 entry to all followers; commit_index does NOT advance past it until the leader appends a term=5 entry that also reaches majority.
quorum_commit_three_nodes — 3-node cluster, leader proposes one entry, one follower acks; commit_index advances (2 of 3 is a majority including the leader).

All five green in Rust, Go, and C++.

Discussion prompts

The Figure 8 commit restriction ("commit only entries of the current term") is famously subtle. Construct a 3-node scenario where omitting it lets a leader commit an entry that a future leader's election can erase.
Why does the leader update match_index[self] after propose? (Otherwise the majority check would always exclude the leader.)
What happens if two leaders coexist briefly (network partition that has not yet healed)? Specifically: which leader can advance commit_index, and why is this safe?

db-17 step 03 — Cross-test and partition

Goal

A Cluster that drives an n-node simulation forward by integer ticks, a --partition CLI flag that drops messages in named directions, and a cross-language scripts/cross_test.sh proving the canonical dump's sha256 is byte-identical across Rust, Go, and C++ for six seeded scenarios including partitions.

Tasks

Cluster::new(seed, nodes). Holds:
- nodes: Vec<RaftNode> (ids 0..nodes);
- drop: BTreeSet<(u32, u32)> (directional message-drop set);
- heap: BinaryHeap<InFlight> ordered by (delivery_time, sender, seq) — InFlight implements Ord such that BinaryHeap behaves as a min-heap;
- seq: u64 (global monotonic);
- pending_proposals: VecDeque<Vec<u8>>.
Cluster::run(rounds, n_proposals). For each tick t in 0..rounds:
1. Enqueue scheduled proposals. schedule[i] = (i+1) * rounds / (n_proposals + 1); if t == schedule[i], push payload "p<i:02>" onto pending_proposals.
2. Inject pending into current leader. Find leader as the (max current_term, min id) node with role == Leader; while pending_proposals is non-empty and a leader exists, drain one payload and call leader.propose(payload). The propose pushes RPCs onto the heap with delivery times computed from splitmix64(seed ^ src ^ dst ^ t) % 3 + 1.
3. Deliver. Pop every InFlight whose delivery_time == t. For each, if (sender, dest) is in drop, discard. Otherwise call nodes[dest].handle(rpc, t) and enqueue any reply RPCs the handler produces.
4. Tick. Iterate nodes in ascending id; call node.on_tick(t) on each; enqueue any RPCs produced.
canonical_dump(&cluster) -> Vec<u8>. As specified in CONCEPTS.md: magic "DSERAFT1" (8 bytes), u32_le(node_count), then for each node in id order: id, current_term, voted_for (i64 LE, -1 for None), role (u8), commit_index, log_len, and each entry's (term, cmd_len, cmd_bytes).
raftctl CLI. Parses --seed, --nodes, --rounds, --proposals, --partition s,d,s,d,.... Calls Cluster::new, inserts every (s, d) pair into cluster.drop, runs, dumps, sha256s, prints lowercase hex with no trailing newline.
scripts/cross_test.sh. For each of the six scenarios (A–F in docs/observation.md), invoke all three binaries with the same args, compare raw dumps with cmp -s, then compare hex hashes. Print the scenario label and OK on success, or the diverging offset and the three hashes on failure. End with === ALL OK ===.

Acceptance

cargo test --release ⇒ ~10 tests pass.
go test ./... ⇒ ~12 tests pass.
ctest --test-dir build ⇒ 100% tests passed.
./scripts/verify.sh ⇒ === OK ===.
./scripts/cross_test.sh ⇒ all six scenarios OK, final === ALL OK ===.
The exact sha256s match docs/observation.md's table. Specifically scenario A is a2299ff06a2ed5ced5842d100bb7867b3ae50f6e7d7da93f835385565f1ed9e9.

Discussion prompts

The proposal-injection step picks the leader by (max term, min id). Why not "first leader found in iteration order"? (Hint: Go's map iteration is randomized; (max term, min id) is content-defined.)
Scenario E (--partition 0,1,0,2,1,0,2,0) drops every message into or out of node 0. What is the only way the resulting log can contain committed entries? Trace which two-node sub-cluster achieves quorum.
Scenario F is an asymmetric partition (0 → 1 only). Why doesn't this cause permanent leadership churn? (Hint: node 1 can still reach node 0 via AppendEntriesReply.)
If you swap BTreeSet for HashSet in Cluster::drop (Rust), the hashes still match — why? But if you swap BTreeMap for HashMap in RaftNode::next_index, they don't. Articulate the rule.

db-18 — Paxos

This lab implements Multi-Paxos consensus in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It is the sibling of db-17 (Raft) and reuses db-16's deterministic simulator discipline: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule, same closed-form proposal schedule.

If db-17 taught you that one consensus algorithm can be expressed identically in three languages, db-18 teaches you that another consensus algorithm — built on different primitives, with no built-in leader concept, and capable of arbitrary concurrent proposers — can be held to the very same bit-level discipline. The two implementations share zero algorithmic code but share all of the determinism machinery, and that is the point.

What is it?

Paxos (Lamport, "The Part-Time Parliament" 1998 / "Paxos Made Simple" 2001) is the original asynchronous consensus algorithm: a family of acceptors collectively decides on a single value per slot despite crashes, message loss, and message reordering. Unlike Raft, Paxos has no first-class leader and no current_term. Its only ordering primitive is the ballot — a lexicographic pair (round, proposer_id) that acceptors monotonically promise to honor.

Single-decree (one-slot) Paxos has two phases:

Phase 1 — Prepare / Promise. A proposer picks a fresh ballot b and broadcasts Prepare(b). An acceptor whose previously promised ballot is ≤ b updates promised := b and replies with every prior accept it holds (each (slot, accepted_ballot, value) triple). On collecting promises from a majority, the proposer enters Phase 2.
Phase 2 — Accept / Accepted. For each slot, the proposer picks the value to propose: if any promise returned a prior accept for that slot, it must re-propose the value with the highest accepted_ballot (Lamport's P2c invariant); otherwise it is free to propose its own client value. It broadcasts Accept(b, slot, v). An acceptor whose promised ballot is ≤ b records accepted[slot] := (b, v) and replies Accepted(b, slot). On collecting accepts from a majority, the proposer declares the slot decided and broadcasts Decided(slot, v) to anyone who didn't get the accept.

Multi-Paxos amortizes Phase 1 across many slots. The proposer who "wins" Phase 1 acts as a distinguished proposer (lab-locally we call this role Leader) and reuses its promised ballot to drive Phase 2 for every subsequent slot, paying the Phase-1 cost only once per ballot. Liveness is preserved by election timeouts: an acceptor that hasn't heard from a leader for ≥ ELECTION_TIMEOUT_MIN + jitter ticks starts its own Phase 1 with a higher round.

This lab implements Multi-Paxos end-to-end. It is the algorithm behind Google Chubby, Google Spanner's paxos groups, Cassandra lightweight transactions, and (in spirit) Apache ZooKeeper's ZAB.

Why does it matter?

Paxos is the historical and theoretical root of asynchronous consensus. Raft, ZAB, Viewstamped Replication, and EPaxos are all reactions to or refinements of Paxos. Reading the paper is easier when you have made the algorithm bit-deterministic with your own hands.
No fixed leader means no "single term" to lean on. Raft's safety flows largely from "exactly one leader per term". Paxos has neither. Its safety flows from the much weaker quorum-intersection argument: any two majorities of an n-node cluster share at least one acceptor, and that acceptor's promised-ballot ordering serializes every accept that could possibly decide a slot. Writing the algorithm in three languages, watching the same sha256 fall out, and then deliberately breaking the quorum (scenario E) is the most visceral way to internalise quorum intersection.
Concurrent proposers are first-class. Paxos lets every node attempt Phase 1 at any time. Dueling proposers are not an error case; they are the normal case during leadership churn. The deterministic simulator lets you replay the exact tick at which two proposers tied, see which ballot won, and confirm the safety invariants held without any "leader lease" magic.
Foundation for the rest of the distributed track. db-19 (ZAB) layers epoch+counter on top of a paxos-ish core; db-20 (distributed KV) feeds Paxos accept-decisions into a key-value state machine; db-23 (capstone) introduces snapshots and reconfiguration on top of whichever consensus engine the student picks (Raft, Paxos, or both).

How does it work?

State (per node)

acceptor    : promised_ballot : Ballot                # global, not per-slot
              accepts         : Map<slot, (Ballot, Vec<u8>)>
learner     : learned         : Map<slot, Vec<u8>>
proposer    : role            : Follower | Candidate | Leader
              my_ballot       : Ballot                # the ballot this node is driving
              prepare_promises: Set<acceptor_id>      # accumulated this election
              prepare_accepted: Map<slot, (Ballot, Vec<u8>)>  # recovered during Phase 1
              accept_count    : Map<slot, Set<acceptor_id>>
              next_slot       : u64                   # next fresh slot to propose
              pending         : Deque<Vec<u8>>        # queued client values
timers      : election_deadline   : u64               # sim-time tick
              last_heartbeat_sent : u64

promised_ballot is global per node (covers every slot, present and future) — this is the standard Multi-Paxos optimization. accepts is per-slot, because each slot is its own single-decree instance. learned is the per-slot decision; once set it never changes.

Ballot ordering

#![allow(unused)]
fn main() {
#[derive(Clone, Copy, Eq, PartialEq)]
struct Ballot { round: u32, proposer_id: u32 }
}

Lex order on (round, proposer_id). Ballot::ZERO = (0, 0) means "no ballot" and compares less than every other ballot. Promotion of promised_ballot is monotonic: once an acceptor has promised b, it will never accept any RPC carrying a strictly lower ballot.

Election timer (liveness)

reset_election_deadline(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

Identical to db-17's election timer. Heartbeats fire every 50 ticks from the current leader to keep follower timers refreshed.

Phase 1 — Prepare / Promise

start_election(t):
    role = Candidate
    new_round = max(promised_ballot.round, my_ballot.round) + 1
    my_ballot = Ballot { round: new_round, proposer_id: self.id }
    prepare_promises = { self.id }                  # self-promise
    prepare_accepted = { slot: (ab, v) | (slot, (ab, v)) in self.accepts }
    if my_ballot >= promised_ballot:
        promised_ballot = my_ballot                 # we promise ourselves too
    broadcast(Prepare { ballot: my_ballot })
    if |prepare_promises| >= quorum():              # n = 1 cluster
        become_leader(t)

on Prepare(b) at acceptor:
    if b >= promised_ballot:
        promised_ballot = b
        if role in {Candidate, Leader} and b > my_ballot:
            step_down(t)                            # higher proposer takes over
        reset_election_deadline(t)
        send Promise(b, accept_ok=true,
                     accepted = sorted_by_slot(accepts),
                     acceptor_id = self.id) → b.proposer_id
    else:
        send Promise(b, accept_ok=false, acceptor_id=self.id) → b.proposer_id

on Promise(b, ok, accepted, from) at candidate:
    if role != Candidate or b != my_ballot: drop
    if not ok: step_down(t); return                 # someone outranks us
    prepare_promises.insert(from)
    for (slot, ab, v) in accepted:
        if slot not in prepare_accepted or ab > prepare_accepted[slot].ballot:
            prepare_accepted[slot] = (ab, v)        # recover highest-ballot value
    if |prepare_promises| >= quorum():
        become_leader(t)

The recovery rule take if ab > current.ballot is the operational form of Lamport's P2c: across any majority of acceptors, the value with the highest accepted ballot for a slot is the only value that could already be decided in that slot, so the new leader must keep proposing it (or anything if no acceptor reports a prior accept).

Phase 2 — Accept / Accepted

become_leader(t):
    role = Leader
    # Re-issue Accepts under our ballot for every recovered slot.
    for slot in sorted(prepare_accepted.keys):
        if slot in learned: continue
        value = prepare_accepted[slot].value
        accepts[slot] = (my_ballot, value)
        accept_count[slot] = { self.id }
        broadcast(Accept { ballot: my_ballot, slot, value })
    next_slot = 1 + max(any seen slot in accepts ∪ learned, or -1)
    last_heartbeat_sent = t
    broadcast(Heartbeat { ballot: my_ballot })
    drain_pending(out)

drain_pending():
    while pending is non-empty:
        value = pending.pop_front()
        slot = next_slot; next_slot += 1
        accepts[slot] = (my_ballot, value)
        accept_count[slot] = { self.id }
        broadcast(Accept { ballot: my_ballot, slot, value })
        try_decide(slot)                            # n=1 cluster

on Accept(b, slot, v) at acceptor:
    if b >= promised_ballot:
        promised_ballot = b
        accepts[slot] = (b, v)
        if role in {Candidate, Leader} and b > my_ballot:
            step_down(t)
        reset_election_deadline(t)
        send Accepted(b, slot, ok=true, self.id) → b.proposer_id
    else:
        send Accepted(b, slot, ok=false, self.id) → b.proposer_id

on Accepted(b, slot, ok, from) at leader:
    if role != Leader or b != my_ballot: drop
    if not ok: step_down(t); return
    accept_count[slot].insert(from)
    try_decide(slot)

try_decide(slot):
    if role != Leader or slot in learned: return
    if |accept_count[slot]| >= quorum():
        v = accepts[slot].value
        learned[slot] = v
        broadcast(Decided { slot, value: v })

on Decided(slot, v) at any node:
    learned[slot] = v
    reset_election_deadline(t)

on Heartbeat(b) at node:
    if b >= my_ballot and role in {Candidate, Leader} and b.proposer_id != self.id:
        step_down(t)
    if b >= promised_ballot or (promised_ballot != ZERO and b == promised_ballot):
        reset_election_deadline(t)

Simulator loop (per tick `t in 0..rounds`)

1. enqueue scheduled proposals  — schedule[i] = (i+1) * rounds / (K+1)
2. drain cluster-pending values into the current leader (if any)
3. pop every in-flight msg with delivery_time <= t and dispatch handle()
4. tick all nodes in ascending id; on_tick may fire election or heartbeat

The leader-pick rule for proposal injection is "lowest-id node with role == Leader". During leadership churn there may be no leader (in which case the value waits in cluster_pending) or even two stale leaders (in which case the lowest id wins). The deterministic choice is what keeps the byte hash stable.

Wire format (Rpc)

Six variants; tagged-union shape in Go, Rust enum and C++ std::variant- backed types. All fields fixed-width, little-endian:

Prepare    { ballot: (round: u32, proposer_id: u32) }
Promise    { ballot, accept_ok: bool, acceptor_id: u32,
             accepted: [(slot: u64, accepted_ballot, value: Vec<u8>)] }
Accept     { ballot, slot: u64, value: Vec<u8> }
Accepted   { ballot, slot: u64, accept_ok: bool, acceptor_id: u32 }
Decided    { slot: u64, value: Vec<u8> }
Heartbeat  { ballot }

The wire format is not serialized to disk by this lab — the simulator passes Rpcs as typed structs in memory. The only thing that is serialized is the canonical dump, and that is what gets hashed.

Canonical dump format

file := magic[8 = "DSEPAX01"] u32_le(node_count) node*

node := u32_le id
        u32_le promised_ballot.round
        u32_le promised_ballot.proposer_id
        u8     role                       # Follower=0, Candidate=1, Leader=2
        u32_le my_ballot.round
        u32_le my_ballot.proposer_id
        u32_le accept_count
        accept * accept_count
        u32_le learned_count
        learned * learned_count

accept  := u64_le slot
           u32_le accepted_ballot.round
           u32_le accepted_ballot.proposer_id
           u32_le value_len
           u8 value[value_len]

learned := u64_le slot
           u32_le value_len
           u8 value[value_len]

Nodes appear in ascending id order; inside each node, both accepts and learned are emitted in ascending slot order. All multi-byte integers are little-endian. The dump is hashed with SHA-256 and the lowercase hex digest is what paxosctl prints to stdout (no trailing newline).

Cross-language invariants

Invariant	Why it matters
splitmix64 constants `0x9E3779B97F4A7C15`, `0xBF58476D1CE4E7B5`, `0x94D049BB133111EB`	identical PRNG output across languages
`election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150`	identical election firing times
`delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3`	identical message scheduling
heap order `(delivery_time, sender, seq)`; `seq` global monotonic	identical delivery sequence
peers iterated in ascending id (`BTreeMap` / `std::map` / explicit `for p:=0;p<n;p++`)	identical broadcast order
acceptor's Promise lists prior accepts in ascending slot order	identical Promise payload bytes
candidate's Phase-1 recovery rule: keep `(ab, v)` with strictly greater `ab`	identical recovered value per slot
`next_slot = 1 + max(seen accept slot ∪ seen learned slot)` after winning Phase 1	identical first fresh slot
`try_decide` quorum check uses ≥ `n/2 + 1` (strict majority, leader counted)	identical decide tick
leader-pick for proposal injection: lowest-id `Leader`	identical client routing
proposal schedule: `schedule[i] = (i+1) * rounds / (K+1)` integer division	identical pending queue contents
`Role` enum order `Follower=0, Candidate=1, Leader=2`	identical dump bytes
dump emits accepts and learned in ascending slot order; nodes in ascending id order	identical dump bytes

Drift in any one of these and scripts/cross_test.sh fails. The companion cmp -l workflow in docs/observation.md walks you from "the hashes differ" to "this exact byte differs" in three commands.

Multi-Paxos vs. Raft (the comparison the labs exist to make)

Dimension	Raft (db-17)	Multi-Paxos (db-18)
ordering primitive	`current_term: u64` (single integer, persisted, monotonic)	`Ballot { round, proposer_id }` lex pair
leader concept	first-class; exactly one leader per term	emergent; "leader" = whoever last won Phase 1
concurrent proposers	forbidden by election safety	allowed (and routine during churn)
consistency check	`prev_log_index / prev_log_term` per AppendEntries	per-slot `accepted_ballot` carried in Promise
Phase-1 cost amortization	none needed (single leader)	Multi-Paxos (one Prepare covers all future slots)
safety from	log matching + election restriction + commit-only-current-term	quorum intersection + Promise reports prior accepts
understandability	designed for clarity (Ongaro 2014)	famously subtle (P2c, dueling proposers)

The lab implementations make these dimensions concrete: scenario A in db-17 takes ~166 ticks to commit a proposal (election + AE round trip); the equivalent scenario A here takes ~150 ticks for Phase 1 plus ~3 ticks per Accept, then the leader runs at Phase-2-only cost until somebody bumps it.

Files

src/rust/ — paxos18 crate + paxosctl binary.
src/go/ — module github.com/10xdev/dse/db18 + cmd/paxosctl.
src/cpp/ — db18_lib static library + paxosctl binary + test_db18.
scripts/verify.sh — builds + runs the unit tests for all three.
scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.

db-18 — References

Primary sources

Leslie Lamport, The Part-Time Parliament, ACM TOCS 1998. The original Paxos paper. Famously hard to read (the Parliament of Paxos allegory hides the algorithm). The mathematics in §2 is the spec; the rest is narrative. https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf
Leslie Lamport, Paxos Made Simple, ACM SIGACT News 2001. The paper to read first. The whole algorithm — single-decree and the Multi-Paxos extension — is on four pages. The P1a / P1b / P2a / P2b / P2c invariants in this paper are the ones whose operational forms the simulator enforces. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
Tushar Chandra, Robert Griesemer, Joshua Redstone, Paxos Made Live — An Engineering Perspective, PODC 2007. Google's Chubby team's writeup of what it took to turn the algorithm into a production system: leader leases, snapshots, group membership, disk corruption, the works. This lab implements roughly §2–§3 of that paper. https://research.google/pubs/paxos-made-live-an-engineering-perspective/
Robbert van Renesse & Deniz Altinbuken, Paxos Made Moderately Complex, ACM CSUR 2015. The most readable end-to-end derivation of Multi-Paxos. Pseudocode in §3 maps almost line-for-line onto this lab's start_election, become_leader, try_decide. https://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf
Heidi Howard, Distributed Consensus Revised, PhD dissertation, Cambridge 2019 (also A Generalised Solution to Distributed Consensus, 2020). Reframes Paxos as one point in a design space parameterised by quorum-intersection requirements; explains why Flexible Paxos works and how Raft, EPaxos, and Vertical Paxos all fit into the same picture. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf

Variants worth knowing

Leslie Lamport, Fast Paxos, Distributed Computing 2006. Allows a single round-trip happy path when only one proposer is active, at the cost of a 3f+1 quorum on the fast path. EPaxos generalises this.
Iulian Moraru, David Andersen, Michael Kaminsky, There Is More Consensus in Egalitarian Parliaments (EPaxos), SOSP 2013. Drops the leader entirely; each command picks its own dependency graph. Production-relevant in geo-distributed systems where any-leader latency is uneven.
Lamport, Malkhi, Zhou, Vertical Paxos, PODC 2009. Decouples reconfiguration from the consensus protocol — the answer to "how do you change the acceptor set without stopping the world".
Lamport, Generalized Paxos, MSR-TR-2005-33. Lets commutative commands be ordered concurrently; precursor to EPaxos.

Reference implementations to read alongside

etcd/raft (Go) — included for comparison; etcd uses Raft, but its testing harness (raftpb deterministic replay) is the spirit of this lab's cross-language test. https://github.com/etcd-io/raft
Apache ZooKeeper (Java) — ZAB is a Paxos-family protocol with primary order; useful counterpoint when reading db-19. https://github.com/apache/zookeeper
Apache Cassandra Lightweight Transactions — production Multi-Paxos in the read/write path. Cassandra picks a fresh ballot per LWT, so it pays the Phase-1 cost every time and skips the Multi-Paxos amortization. Worth reading for what not to do if you care about per-decree latency. https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/service/paxos
Google Spanner (paper, not source) — Spanner uses Paxos groups per shard with leader leases plus TrueTime for external consistency. The algorithm core is what you build here; everything else is layered above. https://research.google/pubs/spanner-googles-globally-distributed-database/
TigerBeetle (Zig) — Viewstamped Replication, a near-Paxos cousin. Deterministic simulator that does almost exactly what this lab's cross_test.sh does, but in one language with thousands of seeds. https://github.com/tigerbeetledb/tigerbeetle

Background reading worth doing

Diego Ongaro & John Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014. Read alongside db-17. Section 10 (related work) is the cleanest published comparison of Raft to Paxos. https://raft.github.io/raft.pdf
Junqueira, Reed, Serafini, ZooKeeper's Atomic Broadcast Protocol: Theory and Practice, DSN 2011. ZAB derivation; see db-19. https://marcoserafini.github.io/papers/zab.pdf
Henry Robinson, Consensus Protocols: Paxos, Cloudera blog 2009. Short, blog-length walkthrough; useful sanity-check after the primary papers. https://blog.cloudera.com/paxos-made-easy-yes-no-maybe/

Cross-lab dependencies

Upstream:
- db-16 distributed-fundamentals (Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale).
- db-17 raft (sibling consensus algorithm; same harness, same canonical-dump discipline, different RPCs and safety arguments).
Downstream:
- db-19 ZAB — leader-based atomic broadcast; the zxid = (epoch, counter) pair generalises this lab's Ballot.
- db-20 Distributed KV — wraps a chosen consensus engine around a key-value state machine. Paxos and Raft are interchangeable plug-ins at that layer.
- db-23 Capstone — adds snapshots, reconfiguration (Vertical Paxos or joint consensus), and multi-shard deployment.

db-18 — Analysis

Required invariants

If any of these is violated, scripts/cross_test.sh will fail, and in the worst case the algorithm itself is unsafe. They are stated in the order it is easiest to reason about them.

Promise monotonicity (P1). For every acceptor and every sim-time tick t, promised_ballot[t] >= promised_ballot[t-1]. The simulator enforces this with a single comparison on each of Prepare, Accept: the message's ballot must be >= promised_ballot before any state mutation. The Promise reply's accept_ok bit is the operational form of P1b.
Accept respects promise (P2a). No acceptor ever stores accepts[slot] = (ab, v) with ab < promised_ballot. The Accept handler short-circuits with accept_ok=false when b < promised_ballot; the leader interprets that bit and steps down instead of advancing accept_count.
Per-slot accept uniqueness under a ballot (P2b). For a fixed slot s and a fixed ballot b, the value v that any acceptor stores under (s, b) is the same value. This holds trivially here because only the leader of ballot b ever sends Accept(b, s, v), and its accepts[s] is set once and never overwritten under its own ballot.
P2c (the safety lemma that needs work). Suppose value v is chosen at slot s under ballot b. Then for any ballot b' > b issued by any proposer, the value field of any Accept(b', s, v') will satisfy v' == v. The mechanism: to issue an Accept at all, the proposer must have collected promises from a quorum at ballot b'. That quorum intersects with the quorum that chose v at b. The intersecting acceptor saw v accepted under b, so its Promise carries (s, b, v). The proposer's recovery rule (take the value whose accepted_ballot is highest) therefore takes v (or a later value chosen under some b'' > b, but inductively that value is also v). So v' == v. QED. The simulator implements this rule in start_election's init of prepare_accepted and in the Promise-handler's if ab > prepare_accepted[s].ballot update.
Decided-once / monotonic learn. Once learned[s] is set on any node, it never changes value. Locally enforced by reading before writing; globally guaranteed by P2c.
Byte-determinism of the dump. Two runs with the same (seed, nodes, rounds, proposals, partition) produce identical canonical dump bytes on every language. This requires every iteration order (peers, slots, accepted-list inside Promise, heap pops on identical (time, sender, seq)) to be fixed. Drift here is what cross_test.sh catches.

Design decisions worth highlighting

promised_ballot is global per node, not per-slot. This is the Multi-Paxos optimization. A per-slot promised-ballot map would be more general (closer to single-decree Paxos per slot) but would cost a Phase-1 per slot. The global ballot lets one Phase 1 cover every present and future slot.
Phase-1 recovery walks every prior accept, not just the latest per slot. The Promise reply contains all of the acceptor's accepts (sorted by slot). The candidate folds them into prepare_accepted with take if strictly greater accepted_ballot. Per the proof of P2c this is the only correct rule; a "latest by receive order" tie-break would lose safety the moment Promises arrived out of order.
my_ballot.round is bumped to max(promised, my_ballot).round + 1 when starting an election, not just promised.round + 1. If this node previously won a higher ballot and stepped down due to a partition heal, it would otherwise re-issue its old ballot and immediately lose to its own historical promise. The max makes forward progress under churn deterministic.
Leader-pick rule: lowest-id Leader. When the simulator must inject a client proposal, it picks the lowest-id node currently in role Leader. There may be zero (queue the proposal in cluster_pending) or, briefly, two stale Leaders (the lower id wins; the other's Accept will fail at acceptors that have already promised the new ballot). Determinism > realism here.
drain_pending runs on every Accepted, not just every tick. In single-node mode (--nodes 1) the leader becomes its own quorum and decides slots inside the broadcast loop. Doing the drain in become_leader and in try_decide means scenario D's hash is independent of how the simulator orders ticks.
Heap key (delivery_time, sender, seq). db-16's invariant. Without the seq tiebreak, Promise messages from two acceptors arriving on the same tick from the same sender (impossible by construction, but the type system doesn't know that) would be reorder-able across languages.
Role enum order. Follower=0, Candidate=1, Leader=2 was chosen to match db-17; any change would propagate into the dump byte at offset 12 + 16 = 28 per node, which would silently invalidate scenario A's canonical hash.

Tradeoffs worth flagging

Concurrent proposers cost throughput, not safety. Two proposers in dueling Phase 1 can ping-pong each other forever in principle. The lab dodges this in two ways: (a) the deterministic simulator can't sustain a livelock because election timeouts are PRNG-jittered per node-id, and (b) once a leader is elected, the election-timer reset on Heartbeat keeps it elected. Production systems add leader leases (Chubby, Spanner) to push the worst case down further.
No commit-only-current-term subtlety. Raft has Figure 8: a newly-elected leader must commit something in its own term before it can ack older entries, otherwise they can be silently overwritten. Paxos sidesteps the problem because P2c forces a new leader to re-Accept any recoverable value under its own ballot; there is no "shadow commit" to retract. The price is the Phase-1-on-every-startup cost.
No native log compaction. This lab's accepts and learned grow unboundedly. A real Multi-Paxos system snapshots a state machine and discards accepts below the snapshot index (see Spanner, Chubby, db-23). Adding snapshots here would require exposing a committed_through watermark in the dump.
No membership change. n is fixed at Cluster construction time. Vertical Paxos (Lamport/Malkhi/Zhou 2009) is the textbook way to add this. db-23 covers it.
Three languages is more work than two. Two languages prove the spec is unambiguous. Three rules out the case where you and your collaborator have committed the same misreading. C++'s std::map and Rust's BTreeMap agreeing with Go's explicit sort.Slice was the only thing that caught a misordered Promise payload in scenario B during development.

Why three languages

Same answer as db-17: the constraint forces the spec to be a spec and not a habit. Sorted-iteration discipline, fixed enum order, little-endian fixed-width integers, no map iteration on the wire — these are easy to get away with in any single language, and the only way to surface them is to ask "would another implementation make the same choice without being told?". For Paxos the question matters even more: the algorithm is sensitive to whether the highest-ballot prior accept is chosen during recovery, and a sort-order bug would make safety stochastic, which is the worst possible failure mode.

db-18 — Execution

One-shot: prove it works

cd db-18-paxos
bash scripts/verify.sh        # all three languages' unit tests
bash scripts/cross_test.sh    # 6 scenarios × 3 binaries × byte-identical hash

verify.sh must end with === verify OK ===. cross_test.sh must end with === ALL OK === and the six per-scenario hashes must match the table in docs/observation.md.

Per-language workflows

Rust

cd src/rust
cargo build --release         # builds paxos18 lib + paxosctl bin
cargo test --release          # 12 unit tests (see verification.md)
./target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Crate layout:

src/lib.rs — paxos18 library: ballot, RPCs, PaxosNode, Cluster, canonical dump, sha256 helper, inline #[cfg(test)] module.
src/bin/paxosctl.rs — CLI entry: parses flags, runs the cluster, emits the sha256 hex digest on a single line with no newline.

Go

cd src/go
go build ./...                # builds package + cmd/paxosctl
go test ./...                 # 11 unit tests
./paxosctl_bin --seed 42 --nodes 3 --rounds 1000 --proposals 5

Module layout:

paxos.go — package db18: same surface as the Rust crate.
paxos_test.go — go test suite.
cmd/paxosctl/main.go — CLI binary.
go.mod — module github.com/10xdev/dse/db18, go 1.22.

C++

cd src/cpp
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
./test_db18                   # 11 unit tests
./paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Source layout:

include/db18/paxos.hpp + src/paxos.cpp — the db18 namespace library.
src/paxosctl_main.cpp — CLI entry.
tests/test_db18.cpp — gtest-style assertions (no framework dependency; pure asserts + main).
CMakeLists.txt — exposes db18_lib, paxosctl, test_db18.

CLI reference

paxosctl has the same flags in all three languages. Anything else on the command line is rejected.

Flag	Type	Default	Meaning
`--seed`	`u64`	required	Seeds splitmix64 for the cluster, every node's election jitter, and every message's delivery delay.
`--nodes`	`u32 (1..=8)`	required	Number of acceptor/proposer nodes; quorum = `nodes/2 + 1`.
`--rounds`	`u64`	required	Number of sim-time ticks to run.
`--proposals`	`u32`	required	Number of client values to inject. Value `i` is `"val-{i}"`, scheduled at tick `(i+1)*rounds/(proposals+1)`.
`--partition`	comma list of `src,dst` pairs (even-length)	(none)	Drop every message with the listed `(src, dst)` ordered pairs. Asymmetric: `0,1` blocks 0→1 but not 1→0. Pass `0,1,1,0` for a symmetric link cut.

Output: a single line of lowercase hex (64 chars), no trailing newline. Exit code 0 on success; non-zero with a stderr message on parse error.

Sample invocations

# Single-node "consensus" — leader is itself, every proposal decides instantly.
paxosctl --seed 1 --nodes 1 --rounds 200 --proposals 5

# Three-node happy path.
paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

# Symmetric partition between 0 and 1 plus 0 and 2 — node 0 is isolated.
paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 3 \
  --partition 0,1,0,2,1,0,2,0

Canonical scenarios

These are the six configurations that cross_test.sh runs. Each combination is a known-stable byte fingerprint; if any of them changes, you have changed semantics and should expect the cross-test to fail until you understand why.

Name	Flags	Notes
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`	happy path, 3-node, no partition
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`	5-node, longer schedule, more decisions
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`	leader election only; no proposals
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`	single-node; quorum = self
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`	node 0 isolated symmetrically; {1,2} retain quorum
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`	asymmetric link cut; minor degradation

Sanity checks

If you only have ten seconds:

( cd src/rust && cargo build --release ) >/dev/null && \
( cd src/go && go build -o paxosctl_bin ./cmd/paxosctl ) >/dev/null && \
( cd src/cpp/build && cmake --build . --target paxosctl ) >/dev/null && \
diff <(src/rust/target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5) \
     <(src/go/paxosctl_bin            --seed 42 --nodes 3 --rounds 1000 --proposals 5) && \
diff <(src/rust/target/release/paxosctl --seed 42 --nodes 3 --rounds 1000 --proposals 5) \
     <(src/cpp/build/paxosctl         --seed 42 --nodes 3 --rounds 1000 --proposals 5) && \
echo OK

Silence + OK = green. Any diff = divergence; jump to docs/observation.md § Divergence runbook.

db-18 — Observation

Expected canonical hashes

Six configurations are pinned in scripts/cross_test.sh. The lab is green iff all three binaries (Rust release, Go release, C++ Release) emit exactly these strings on stdout (no trailing newline):

Name	Flags	SHA-256 of canonical dump
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`	`0a35fdad1dd97c76a40a61b020c6181a56c4a40d4f723cb68fe70c2112aa9b63`
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`	`3cc6cae6cb7f9d2b7cb88088a0f22581ac4c41bd86bab1b3676dd0ba33fd7ead`
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`	`f28d025af748a790beded6167115c7094a7f939b45d439728e4d6b7e144c3be0`
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`	`e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139`
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`	`674e62d809248ac99401054c195d29b0e2eed6ccc78ec45e96da8aaf69c36096`
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`	`7d80176abad54e533b2f4174e84f58432a000255fbb2ecbbb1dd915cb6bb6ab5`

These are the contract. Edit any production code such that one of these strings changes and you have changed semantics; reverify end-to-end before you ship.

Walking the wire: scenario D byte-by-byte

Scenario D is the shortest possible dump (one node, five proposals, all decided locally). Use it as a Rosetta Stone before debugging the multi-node hashes. The layout is magic || u32 node_count || node[], and the node payload starts at offset 12.

00..07  4453 4550 4158 3031     "DSEPAX01"            magic
08..0b  01 00 00 00             node_count = 1
0c..0f  00 00 00 00             node.id = 0
10..13  rr rr 00 00             node.promised_ballot.round       (round it won at)
14..17  00 00 00 00             node.promised_ballot.proposer_id (= self.id = 0)
18      02                      node.role = Leader (2)
19..1c  rr rr 00 00             node.my_ballot.round
1d..20  00 00 00 00             node.my_ballot.proposer_id
21..24  05 00 00 00             accept_count = 5
... 5 × {u64 slot, u32 ab.round, u32 ab.proposer_id, u32 value_len, value bytes}
... then u32 learned_count = 5 and 5 × {u64 slot, u32 value_len, value bytes}

Run:

src/rust/target/release/paxosctl --seed 1 --nodes 1 --rounds 200 --proposals 5
# e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139

To dump the raw bytes (skip the sha256 step) hack the binary to print canonical_dump instead of sha256_hex(&canonical_dump); do it locally only — the canonical CLI output is the sha256.

Walking the wire: scenario C (no proposals)

Scenario C runs three nodes for 500 ticks with --proposals 0. Exactly one of them will be elected leader; nobody decides anything. The dump therefore has accept_count == 0 and learned_count == 0 for every node. The bytes that do change between languages if you have an iteration-order bug are the per-node promised_ballot.round values (the elected leader's round depends on whether some other proposer almost-elected first). If C is the failing scenario, you have an election-timer determinism bug, not a Phase-2 bug.

Divergence runbook

If cross_test.sh prints MISMATCH scenario X, follow this script:

# 1. Capture the raw bytes from each binary. Patch paxosctl locally
#    to print `canonical_dump` raw instead of sha256 hex, run once,
#    then revert the patch. Save to rust.bin, go.bin, cpp.bin.

cmp -l rust.bin go.bin | head
cmp -l rust.bin cpp.bin | head

cmp -l prints byte_offset rust_value go_value in octal. Map the first offset to the field it belongs in:

Offset	Field	Likely culprit
0..7	magic `"DSEPAX01"`	wrong magic literal
8..11	`node_count`	wrong `u32_le` writer, wrong endianness
12 + k*node_size + 0..3	`node.id`	iterating nodes in wrong order (not ascending id)
12 + k*node_size + 4..11	`promised_ballot`	election-timer drift or wrong PRNG seed mix
12 + k*node_size + 12	`role` (1 byte)	enum reordered (must be Follower=0, Candidate=1, Leader=2)
12 + k*node_size + 13..20	`my_ballot`	step-down logic differs (e.g., resetting `my_ballot` to zero or not)
12 + k*node_size + 21..24	`accept_count`	one acceptor accepted a slot the others did not — Phase-2 message ordering bug
inside an accept tuple	`slot`	accepts iterated in receive order, not sorted by slot
inside an accept tuple	`accepted_ballot`	Phase-1 recovery used a wrong rule (e.g., last-write-wins instead of highest-ballot)
inside an accept tuple	`value_len` / `value`	wrong proposal scheduled at this slot — proposal-injection rule or leader-pick rule differs
inside the learned section	`slot` / `value`	the difference is downstream of an accept-section difference; fix that first

Tick-level diff

If cmp -l flags a divergence inside the accepts of node 1, add eprintln!/fmt.Fprintln(os.Stderr, ...)/std::cerr lines in each implementation at the boundaries of the suspect ticks:

#![allow(unused)]
fn main() {
// after handle() and after on_tick():
eprintln!("t={} id={} promised={:?} role={:?} my={:?} accepts={:?} learned={:?}",
          t, id, n.promised_ballot, n.role, n.my_ballot, n.accepts, n.learned);
}

Run all three, diff -u rust.log go.log. The first differing tick is the bug.

Most common culprits in practice

Forgetting to sort the Promise payload by slot. Go's map iteration order is randomized; you must sort.Slice before appending to the wire.
Reading next_slot before recovering from prepare_accepted. If recovery doesn't update next_slot = max + 1, the leader will double-allocate a slot that already has a recovered accept, silently overwriting it.
Letting step_down clear promised_ballot. Promises are forever; only my_ballot is candidate-state.
Counting yourself twice in accept_count. Both become_leader and try_decide insert self; the second one is a no-op only if accept_count is a set, not a multiset.
Iterating peers as for p in nodes.iter() on a HashMap. Use BTreeMap in Rust, std::map in C++, and explicit for p := uint32(0); p < n; p++ in Go.

db-18 — Verification

Prerequisites

Rust ≥ 1.74 with cargo on PATH.
Go ≥ 1.22 (module declares go 1.22).
CMake ≥ 3.20 and a C++17 compiler (Apple clang ≥ 14, gcc ≥ 11).
A POSIX sha256sum is not required — each binary computes its own sha256 in-process.

One command

cd db-18-paxos
bash scripts/verify.sh && bash scripts/cross_test.sh

Green is === verify OK === followed by === ALL OK ===. Anything else is a regression.

What `verify.sh` does

Rust — cargo build --release then cargo test --release. Builds paxos18 lib + paxosctl binary; runs the 12 inline tests in src/rust/src/lib.rs. Expected output ends with test result: ok. 12 passed.
Go — go build ./... then go test ./.... Builds cmd/paxosctl + package; runs the 11 tests in src/go/paxos_test.go. Expected output ends with PASS and ok github.com/10xdev/dse/db18.
C++ — cmake -DCMAKE_BUILD_TYPE=Release .., make -j, then ./test_db18. Builds db18_lib, paxosctl, test_db18; the test binary prints one line per assertion-group and ends with ALL 11 TESTS PASSED.

If any of these three blocks fails, the script exits non-zero and the rest does not run.

What `cross_test.sh` does

For each of the six canonical scenarios (A–F), it invokes the three release binaries with identical flags, captures stdout, and asserts rust == go == cpp byte-for-byte. The output prints the matching hash on success; on mismatch it prints all three hashes and exits.

The script does not trust the canonical hashes from this repo to be correct — it only enforces consistency among the three implementations. The "is the hash also the historical fingerprint" check happens by comparing the script's output against docs/observation.md § Expected canonical hashes.

What green guarantees

If both scripts pass:

Safety in the modeled environment. For every seed × scenario in the suite, no acceptor stored a decided value that contradicts another node's decided value for the same slot. The unit tests include cases for dueling proposers, partitions, and Phase-1 recovery; the cross-test sweeps the same scenarios across three independent implementations.
Determinism. Same inputs ⇒ same canonical dump bytes, across languages and across machines (modulo endianness — all targets are little-endian).
Liveness in the modeled environment. Scenarios A, B, D, F all include proposals and run long enough to elect a leader and decide them. Scenarios C and E exist to confirm we don't decide when we shouldn't (C has no proposals; E isolates node 0 so it must not influence the chosen value while {1,2} still carry the load).

What green does not guarantee

Behavior outside the canonical scenarios. The state space of three-process Multi-Paxos is exponential; six fingerprints are an acceptance test, not a model checker. A real Paxos audit needs TLA+ (see references.md § Background reading).
Performance. No latency or throughput is checked. Scenario A takes ~150 ticks of simulated time to decide; that is a function of the configured ELECTION_TIMEOUT_MIN, not a wall-clock SLA.
Snapshotting, membership change, log compaction. None of these exist in this lab; the dump grows unboundedly in accepts and learned. db-23 covers the rest.
Production safety primitives — leader leases, fsync barriers, on-disk checksums, recovery from torn writes, byzantine actors. All deliberately out of scope.

Invariant assertions in code

Each implementation re-checks the lab's invariants where the cost is near-zero. The most load-bearing assertions are listed below; their firing means the test that triggered them is reporting a symptom of a Phase-1 / Phase-2 bug, not a flaky test.

Where	Assertion	What it catches
`Handle::Promise` (all 3 langs)	leader ignores Promise if `b != my_ballot`	stale Promise replies from a previous Phase 1 (would inflate the quorum count and decide too early)
`Handle::Accepted` (all 3 langs)	leader ignores Accepted if `b != my_ballot`	same, for Phase 2
`try_decide` (all 3 langs)	only the current Leader can mark a slot learned	a stepped-down node attempting to declare a decision (would split-brain `learned`)
Promise payload serialization (all 3 langs)	accepts iterated in ascending slot order	undetected map-iteration drift between languages
`canonical_dump` writer (all 3 langs)	nodes in ascending id; per-node `accepts` and `learned` in ascending slot	drift between three independent dump writers
Rust unit `single_node_in_three_node_partition_does_not_decide`	isolated minority must have empty `learned`	a quorum-counting bug that lets a single node decide
Go unit `TestMajorityRequiredToDecide`	1-of-3 cannot decide	same, Go side
C++ unit `cannot_decide_in_minority`	1-of-3 cannot decide	same, C++ side

db-18 — Broader Ideas

The lab implements textbook Multi-Paxos with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.

Fast Paxos (Lamport 2006)

Skips Phase 2's "leader replays" step on the happy path by letting any proposer broadcast Accept directly. The cost: the fast-path quorum must be ⌈3n/4⌉ instead of ⌊n/2⌋ + 1, so 4-of-5 instead of 3-of-5. When two proposers collide on the fast path the system falls back to classic Paxos. Worth implementing as db-18b once the classic version is fluent — it reuses the entire wire format and only changes the proposer-side state machine.

EPaxos (Moraru, Andersen, Kaminsky, SOSP 2013)

Drops the leader entirely. Each command picks its own dependency graph among recently-issued commands and decides in one RTT if no conflict, two RTTs otherwise. The "deterministic simulator + three implementations" discipline you build here is what makes EPaxos's notoriously subtle conflict-detection logic testable at all. Used in production at Facebook (Bunshin) and as the backbone of some geo-distributed configuration stores.

Generalized Paxos (Lamport, MSR-TR-2005-33)

Allows commutative commands to be partially ordered concurrently, not totally ordered serially. The state-machine layer must explicitly declare command commutativity. Precursor to EPaxos. Operationally similar to CRDTs at the storage layer (db-21) but with hard consensus underneath.

Vertical Paxos (Lamport, Malkhi, Zhou, PODC 2009)

Separates the "agree on the value at slot S" problem from the "agree on the membership of the acceptor set at slot S" problem, by delegating reconfiguration to an auxiliary master. Cleaner than joint-consensus (Raft's approach) and Lamport's preferred way to do membership changes. db-23 will revisit.

Flexible Paxos (Howard 2016, dissertation 2019)

Observation: the two quorums in Paxos don't have to be majorities. They only have to intersect. So Phase-1 quorum + Phase-2 quorum just have to sum to more than n. Production payoff: you can run with a smaller Phase-2 quorum (lower latency on the common path) in exchange for a larger Phase-1 quorum (higher cost during leadership churn). A great teaching variant to layer on top of this lab once the canonical hashes are stable.

Production systems to study

Google Chubby

Five-replica Paxos lock service powering Google's lookup infrastructure (DNS, leader election for other services). Chandra et al.'s Paxos Made Live (PODC 2007) is the canonical writeup of what it took to turn the algorithm into a system: leader leases, snapshots every few minutes, master-side group membership, three generations of disk-corruption handling. Read alongside this lab once green.

Google Spanner

Multi-Paxos per shard. Spanner's contribution above Multi-Paxos is TrueTime — a clock API with bounded uncertainty that lets the system serve external-consistency-preserving reads without a Paxos round. The Paxos layer itself is exactly the algorithm you've implemented, plus production hardening.

Apache Cassandra LWT

Lightweight Transactions use Multi-Paxos to give linearizable CAS-style updates on top of Cassandra's eventually-consistent replication. Cassandra picks a fresh ballot per request, so it pays the Phase-1 cost every time and never amortizes — a clean illustration of the Multi-Paxos tradeoff in reverse.

Microsoft Azure Service Fabric

Uses a Paxos variant (Smart Actors) under the hood for ring-leader election and replicated state services. Less publicly documented; the architectural papers are paywalled behind ASE/SOSP, but worth chasing for an industrial counterpoint.

Apache ZooKeeper (ZAB)

Not strict Paxos but in the same family. ZAB layers epoch+counter on top of a primary-order protocol; the zxid pair is the direct analogue of this lab's Ballot. db-19 builds it.

Performance experiments worth running

The deterministic simulator is too clean for real performance work, but the simulator's ticks are a fine unit of cost for comparative experiments:

Phase-1 amortization sweep. For nodes ∈ {3,5,7,9}, run proposals = 50 and count the number of ticks to decide the last slot. The expected curve is linear in nodes for the first decision (Phase 1 costs a broadcast round-trip per acceptor) and constant per slot thereafter (Phase 2 RTT).
Election-timer jitter sensitivity. Vary ELECTION_TIMEOUT_SPAN and measure how often dueling proposers ping-pong before someone wins. The textbook answer is "wider jitter = fewer collisions = fewer ballot bumps", and the simulator lets you confirm it without networking.
Quorum recomputation latency. For Flexible Paxos configurations, plot Phase-2 latency against Phase-1 quorum size. Howard 2016 has the analytical curve; you can ground-truth it.
Comparison to Raft (db-17). Same flags, same scenarios, same measurement. The lab structure is identical on purpose.

What "production-quality" would require beyond this lab

Disk durability. A real acceptor fsyncs promised_ballot, accepts, and (depending on design) learned before replying to a Promise / Accepted. Without that, a crash-restart cycle can silently retract a promise and break safety.
Snapshotting. accepts and learned grow forever in this lab. A real system periodically snapshots the state machine and garbage-collects acks below the snapshot index. The snapshot itself must be agreed on by Paxos (or by a separate snapshot coordinator), which is a whole-other lab.
Membership reconfiguration. Adding/removing acceptors safely is non-trivial: you must either run two configurations in parallel during the transition (Raft's joint consensus) or delegate the membership decision to a higher layer (Vertical Paxos). db-23 picks this up.
Leader leases. Production Paxos systems give the current leader a time-bounded lease to serve reads without consulting acceptors. This requires a synchronized clock model (Spanner's TrueTime, or weaker lease-renewal protocols) — orthogonal to consensus per se but tightly coupled in real deployments.
Witness / arbiter nodes. Some deployments allow a third node to hold no data but break tie-vote symmetry. Implementing this while keeping safety proofs sound requires care.
Recovery from disk corruption. Real-world failure modes include silent bit-rot of promised_ballot. The defensive posture is to checksum every persisted record and treat a checksum failure as "I've never voted for anything" — a strict safety superset of treating it as "I voted for a high ballot", but at the cost of liveness during recovery.
Observability. Live systems need per-slot decision latency histograms, per-acceptor promise rejection counters, leader flap detection. The canonical dump is the right shape of observability but the real one runs continuously rather than on-demand.

db-18 step 01 — Single-decree Paxos

Goal

Build the two-phase Paxos protocol for one slot. A proposer must be able to drive a value to a decision in the presence of competing proposers, and an acceptor's recorded state must be exactly enough for the next proposer to recover any value that might have already been chosen. The byte layout of acceptor state must be identical across Rust, Go, and C++.

Tasks

Ballot. Define Ballot { round: u32, proposer_id: u32 } with lexicographic ordering (round first, then proposer_id as tie-break). Provide a Ballot::ZERO constant equal to (0, 0). Every comparison in the rest of the protocol uses this order.
PaxosNode skeleton. Each node carries:
- id: u32, n: u32 (cluster size), quorum = n/2 + 1.
- role: Role (Follower / Candidate / Leader).
- promised_ballot: Ballot — highest ballot ever promised (one value, shared across all slots in this Multi-Paxos style).
- my_ballot: Ballot — this proposer's current attempt.
- accepts: BTreeMap<Slot, (Ballot, Vec<u8>)> — for each slot, the highest-ballot accept observed.
- learned: BTreeMap<Slot, Vec<u8>> — decided values, in slot order.
on_prepare(ballot). If ballot >= promised_ballot, set promised_ballot = ballot and reply Promise { accept_ok = true, accepted = [(slot, ab, value) for every entry in accepts] }. Otherwise reply Promise { accept_ok = false, accepted = [] }. The full walk over accepts is what makes Phase 1 the recovery step.
on_promise. A proposer collects promises until it has a quorum. For each slot mentioned in any promise, it adopts the value of the highest-ballot accept (Paxos safety property P2c). For slots with no prior accept, the proposer is free to use its own pending value. The proposer then transitions to Leader and broadcasts Accept for every slot in its working set.
on_accept(ballot, slot, value). If ballot >= promised_ballot, update promised_ballot = ballot, overwrite accepts[slot] = (ballot, value), reply Accepted { accept_ok = true }. Otherwise reply Accepted { accept_ok = false }. Note that an accepted value is never unaccepted — only superseded by a higher-ballot accept on the same slot.
on_accepted. A proposer that collects a quorum of accept_ok = true for the same (slot, ballot) learns the value and broadcasts Decided { slot, value }. Learners (every node) record learned[slot] = value on Decided.

Acceptance

Inline unit tests in each language. Names below are the Rust form; Go uses TestSha256KnownVectors style, C++ uses test_sha256_known_vectors:

sha256_known_vectors — empty, "abc", and the lazy-dog vector all hash to the well-known constants. Locks the SHA-256 implementation to RFC 6234.
dueling_proposers_higher_ballot_wins — acceptor promises (1,1), then (1,2) arrives and is promised; a stale Accept at (1,1) is nacked. Verifies promised_ballot monotonicity.
promise_carries_prior_accept_for_recovery — acceptor with a prior accept at ballot b1 on slot 0 receives a Prepare at higher ballot b2; the Promise must include the (0, b1, value) tuple so the new leader can re-propose the value. This is P2c.
majority_required_to_decide — proposer in a 5-node cluster with only 2 of 5 accepts must not call the slot decided; the third accept tips it over the threshold.
ballot_ordering_is_lexicographic — (1,9) < (2,0), (1,0) < (1,1), ZERO < (0,1). Locks the comparator.

All five green in Rust, Go, and C++.

Discussion prompts

Quorum intersection. Why must any two quorums share at least one acceptor? Walk through what breaks if a 4-node cluster used quorum size 2 instead of 3.
Why P2c. Suppose Phase 1 returned just accept_ok without the list of prior accepts. Construct a 3-node run where a value v is chosen, then a higher-ballot proposer chooses a different value w. Why does carrying prior accepts forward in the Promise prevent this?
Ballots vs terms. Raft's term is a single u64. Paxos's ballot is (round, proposer_id). What does the proposer_id tie-break buy you that a single counter would not, and why does Raft not need it?

db-18 step 02 — Multi-Paxos and the replicated log

Goal

Generalise single-decree Paxos into a log. A stable leader runs Phase 1 once, then drives a sequence of slots through Phase 2 only — that is the entire point of Multi-Paxos. Newly elected leaders must recover any partially-accepted slots before issuing new proposals, so the log stays contiguous and every committed prefix is identical on every replica.

Tasks

Leader election trigger. A Follower or Candidate whose election_deadline elapses bumps my_ballot.round = max(my_ballot.round, promised_ballot.round) + 1, sets my_ballot.proposer_id = self.id, transitions to Candidate, and broadcasts Prepare { ballot: my_ballot }. Election deadline is reset with the same splitmix64 jitter formula as Raft: t + 150 + splitmix64(seed ^ id ^ t) % 150.
become_leader. On collecting quorum promises for my_ballot, transition to Leader, then:
- Compute next_slot = max(slot in any promise.accepted) + 1, defaulting to max(learned.keys()) + 1 if no accepts were reported.
- For every slot in [0, next_slot) that appears in any promise: adopt the value of the highest-ballot accept and broadcast Accept { my_ballot, slot, value } (this is the recovery sweep — it re-proposes potentially-chosen values under the new ballot).
- Call drain_pending to attach pending client values to the next free slots, broadcasting Accept for each.
Heartbeat. A Leader whose heartbeat_due elapses broadcasts Heartbeat { ballot: my_ballot }. Followers reset their election timers on any inbound RPC from the current leader. This is what makes Multi-Paxos amortise Phase 1: as long as heartbeats arrive, no one starts a new ballot.
Decided broadcast. When a leader's try_decide(slot) sees a quorum of accept_ok=true for the slot's ballot, it marks learned[slot] = value and broadcasts Decided { slot, value } to every node. Learners record the value in learned; the leader does not need to re-decide on receipt.
Lowest-id leader rule. When tests inspect "the" leader of a cluster, they pick the Leader with the lowest id. This is a deterministic tie-break for the (rare) case where two nodes briefly both believe themselves leader during a flap; the safety invariants do not depend on at-most-one Leader, only on at-most-one chosen value per slot per ballot.

Acceptance

Inline unit tests in each language:

single_node_decides_every_proposal — a 1-node cluster (quorum 1) with proposals = 3 ends with learned = [(0, "val-0"), (1, "val-1"), (2, "val-2")]. Degenerate case but verifies the leader path.
three_node_elects_single_leader — Cluster::new(42, 3) after 500 ticks with zero proposals has exactly one node in role Leader.
three_node_replicates_proposals — Cluster::new(42, 3) after 1000 ticks with proposals = 5 has every node's learned of length 5 and byte-identical to node 0's.
multi_slot_log_is_contiguous — 10 proposals on a 3-node cluster yield slot keys 0..10 on every node, no gaps.
partition_heals_progress_resumes — drop all traffic between node 0 and the other two; the surviving pair {1, 2} still elects a leader and decides 4 proposals. Demonstrates that Multi-Paxos liveness depends on some quorum being connected, not on the original leader being reachable.

All five green in Rust, Go, and C++.

Discussion prompts

Amortisation. Why is the Phase 1 cost paid only at leader change in Multi-Paxos but on every decision in single-decree Paxos? What is the steady-state message count per decision on a 5-node cluster?
Leader leases. Real systems (Spanner, Chubby) layer a lease on top of Multi-Paxos so the leader can serve linearizable reads without quorum. What changes in the safety argument if you serve reads off the leader without a lease?
Recovery cost. A new leader must walk every acceptor's full accepts map for the recovery sweep. What is the message size in bytes for a log with 1M slots and 256-byte values? What optimisations (truncation, snapshots, min_slot exchange) would you add for production?

db-18 step 03 — Cross-language determinism

Goal

The Rust, Go, and C++ implementations must, given the same (seed, nodes, rounds, proposals, partition) CLI inputs, produce the byte-identical canonical dump and therefore the same SHA-256. This is the third leg of the lab: protocol correctness plus simulator determinism plus serialisation discipline.

Tasks

Discrete-event simulator. A Cluster owns a min-heap of pending RPCs keyed (delivery_time, src, seq). seq is a monotonically increasing per-cluster counter assigned at send time, breaking ties when two RPCs from the same sender become deliverable on the same tick. Every send pushes onto the heap; every tick pops everything due, dispatches it via node.handle(rpc, src, t, &mut out), and pushes any reply RPCs back onto the heap with delivery = t + 1 + splitmix64(seed ^ src ^ dst ^ seq) % 3.
Iteration discipline. All iteration over collections of nodes, slots, or peers must be in sorted order. Rust uses BTreeMap / BTreeSet exclusively. Go uses sort.Slice / sort.Ints before every loop over a map's keys. C++ uses std::map / std::set. A single iteration over a hash map anywhere in the protocol path will diverge across languages on ~2000 ticks.
Partition modelling. The Cluster carries a Drop: Set<(u32, u32)> of dropped unidirectional edges. The CLI flag --partition s,d,s,d,... parses pairs and inserts them. Asymmetric partitions are intentional: --partition 0,1 only drops 0→1 traffic, not 1→0. Scenario F exercises this.

Canonical dump. canonical_dump(&cluster) writes:

"DSEPAX01"                     (8 bytes magic)
u32_le(node_count)
for each node in ascending id:
    u32_le(id)
    u32_le(promised_ballot.round)
    u32_le(promised_ballot.proposer_id)
    u8(role)                   (Follower=0, Candidate=1, Leader=2)
    u32_le(my_ballot.round)
    u32_le(my_ballot.proposer_id)
    u32_le(accepts_len)
    for each (slot, (ballot, value)) in accepts, ascending slot:
        u64_le(slot)
        u32_le(ballot.round)
        u32_le(ballot.proposer_id)
        u32_le(value_len)
        value bytes
    u32_le(learned_len)
    for each (slot, value) in learned, ascending slot:
        u64_le(slot)
        u32_le(value_len)
        value bytes

Hash the bytes with SHA-256, print lowercase hex, no trailing newline.

CLI: paxosctl. Each language ships a binary that accepts --seed <u64> --nodes <u32> --rounds <u32> --proposals <u32> [--partition s,d,...], runs the cluster for rounds ticks with proposals scheduled at tick = (i+1) * rounds / (proposals+1), value = b"val-" + itoa(i), dumps canonical bytes, prints the hex SHA-256.
scripts/cross_test.sh. Builds all three binaries, runs the 6 scenarios A–F against each, compares the three hashes to the canonical table, and exits non-zero on mismatch. The script ends with === ALL OK === on success.

Acceptance

Inline unit tests in each language:

dump_deterministic_across_runs — two independent Cluster::new(42, 3) instances each run 1000 ticks with 5 proposals produce byte-identical dumps. Confirms intra-language determinism.
Scenario A --seed 42 --nodes 3 --rounds 1000 --proposals 5 → 0a35fdad1dd97c76a40a61b020c6181a56c4a40d4f723cb68fe70c2112aa9b63
Scenario B --seed 7 --nodes 5 --rounds 2000 --proposals 20 → 3cc6cae6cb7f9d2b7cb88088a0f22581ac4c41bd86bab1b3676dd0ba33fd7ead
Scenario C --seed 99 --nodes 3 --rounds 500 --proposals 0 → f28d025af748a790beded6167115c7094a7f939b45d439728e4d6b7e144c3be0
Scenario D --seed 1 --nodes 1 --rounds 200 --proposals 5 → e5e0248c7c4fa20991b90afdac828eab91a7414497461dadc2e1553040693139
Scenario E --seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0 → 674e62d809248ac99401054c195d29b0e2eed6ccc78ec45e96da8aaf69c36096
Scenario F --seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1 → 7d80176abad54e533b2f4174e84f58432a000255fbb2ecbbb1dd915cb6bb6ab5

All six match across Rust, Go, and C++; bash scripts/cross_test.sh exits 0 with === ALL OK ===.

Discussion prompts

Sort discipline. Find the language-default hash map in your language. What is its iteration order? What is the cost of replacing it with the language's ordered map for the canonical dump path only versus everywhere?
SplitMix64. Why is splitmix64 a good fit for a deterministic simulator clock when something like rand::thread_rng() is not? Walk through the three constants — what are they and why?
Three languages. What classes of bug does the cross-language test catch that a single-language test cannot? (Hint: think signed-vs-unsigned overflow, default hash randomisation, iteration order, integer-promotion rules in comparisons.)

db-19 — ZAB (ZooKeeper Atomic Broadcast)

This lab implements ZAB — the atomic broadcast protocol that drives Apache ZooKeeper — in Rust, Go, and C++, all three producing a byte-identical sha256 of a canonical cluster dump for any (seed, nodes, rounds, proposals, partition) configuration. It inherits the deterministic-simulator discipline of db-16 and db-17: same splitmix64 seeding, same (delivery_time, sender, seq) heap tie-break, same "sorted iteration on the wire" rule.

Where db-17 Raft taught you that one consensus algorithm can be pinned down to a single byte sequence across three languages, db-19 ZAB does the same exercise for a different algorithm with a meaningfully different recovery story: an explicit Discovery / Synchronization phase between leader election and steady-state broadcast, and a transaction identifier (zxid) that pairs an epoch with a counter rather than Raft's single monotonic term + index.

What is it?

ZAB (Reed & Junqueira, LADIS 2008; Junqueira, Reed & Serafini, DSN 2011) is the primary-backup atomic broadcast protocol that ZooKeeper uses to keep its replicated state machine consistent. It is not a generic consensus library; it was designed specifically for ZooKeeper's workload: a small, well-known cluster (3, 5, 7 nodes), a small in-memory state machine, and a strong primary-order guarantee that arbitrary client requests served by the same primary are delivered in the order the primary chose.

ZAB decomposes into four phases. Phase 0 is the original FastLeader- Election; later papers fold it into Phase 1.

Leader election (FastLeaderElection). Every node starts in Looking. Each node broadcasts its current vote — initially for itself — carrying (last_zxid, server_id). On receiving a peer vote, a Looking node updates its own vote to that peer if (peer.last_zxid, peer.id) > (own.last_zxid, own.id) lexicographically, then re-broadcasts. When a quorum of voters agree on the same target, that node is elected: it transitions to Leading, everyone else who voted for it transitions to Following.
Discovery. The new prospective leader picks a fresh new_epoch = max(accepted_epoch, current_epoch) + 1, sets its own accepted_epoch = new_epoch, and broadcasts NewEpoch(new_epoch). Each follower that accepts updates its accepted_epoch and replies with AckEpoch(current_epoch, last_zxid). Once a quorum of AckEpochs arrives, the leader knows the highest (current_epoch, last_zxid) in the surviving quorum — that node's history is the one that must survive.
Synchronization. The leader bumps its current_epoch = new_epoch, resets the per-epoch counter, and broadcasts NewLeader(new_epoch, history) — the whole history that this epoch will start from. Followers replace their local history with the leader's, set current_epoch = new_epoch, and reply AckLeader(new_epoch). On a quorum of AckLeaders the leader declares itself synced and broadcasts Commit(last_zxid_of_history) so followers can advance last_committed past the synced tail.
Broadcast (steady state). Now indistinguishable from Raft's replication phase, modulo names. For each client proposal, the leader assigns zxid = (current_epoch, ++counter), appends to its history, broadcasts Propose(txn). Followers append in zxid order and reply Ack(zxid). On Ack quorum the leader broadcasts Commit(zxid). Heartbeats are implemented as periodic re-sends of the last Commit (or NewLeader during pre-sync) — receiving one from the current leader refreshes the follower's election timer.

The simulator drives sim time forward in integer ticks; messages are scheduled into a heap with deterministic (delivery_time, sender, seq) order; an optional partition set drops messages in named directions.

Why does it matter?

ZAB is the algorithm under ZooKeeper — and ZooKeeper is the coordination kernel under Kafka (pre-KRaft), HBase, Hadoop YARN, Mesos, Cassandra's lightweight transactions (historically), Druid, and a long list of production systems. Knowing exactly how the NewLeader / Sync handshake works is the difference between operating ZooKeeper and understanding it.
ZAB and Raft cover the same problem with meaningfully different shapes. ZAB has an explicit recovery handshake that Raft folds into the AppendEntries consistency check; ZAB's zxid = (epoch, counter) is essentially Raft's (term, index), but the role each plays is subtly different. Implementing both back-to-back makes the contrast concrete instead of conceptual.
Three byte-identical implementations force the spec to be unambiguous. Anywhere ZAB "depends on the implementation" — election tie-break, vote rebroadcast on update, AckEpoch idempotency, heap scheduling — has to be pinned down. The cross-language sha256 makes drift loud.
Reproducible partitions. With a deterministic --partition s,d,... flag and a seeded simulator, you can replay the exact sequence of message drops, leader churn, and committed transactions that triggered a bug, on any machine, in any of the three languages.
Foundation for the rest of the track. db-20 distributed-kv will plug a consensus engine into a real key-value store; db-23 capstone composes the simulator harness across multiple replicated shards.

How does it work?

State (per node)

persistent  : current_epoch   : u32     # epoch of the leader we accepted into sync
              accepted_epoch  : u32     # epoch we've ack'd via NewEpoch (>= current_epoch)
              history         : Vec<Txn>
              last_committed  : ZxId

volatile    : role            : Looking | Following | Leading
              leader_id       : Option<u32>

election    : vote_target_id  : u32              # who we currently vote for
              vote_target_zxid: ZxId             # the (last_zxid) we voted on
              vote_view       : Map<voter_id, leader_id>   # tally

leader-only : pending_new_epoch : u32
              epoch_acks        : Set<follower_id>   # AckEpoch quorum tracker
              leader_acks       : Set<follower_id>   # AckLeader quorum tracker
              synced            : bool
              next_counter      : u32                # zxid counter under current_epoch
              ack_set           : Map<ZxId, Set<follower_id>>

timers      : election_deadline   : u64                # sim-time tick
              last_heartbeat_sent : u64

Election timer

reset_election_deadline(t):
    election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150

A 150-tick base plus 150 ticks of seeded jitter avoids split-vote loops. Heartbeats fire every 50 ticks once a leader is synced.

FastLeaderElection (Phase 0)

on entering Looking:
    vote_target_id   = self.id
    vote_target_zxid = self.last_zxid()
    vote_view.clear(); vote_view[self.id] = self.id
    broadcast LookForLeader { self.id, self.last_zxid, current_epoch }
    broadcast Vote          { self.id, self.last_zxid, current_epoch, leader=self.id }
    check_election()

on Vote(voter, peer_zxid, _, leader_chosen) while Looking:
    if (peer_zxid, voter) > (vote_target_zxid, vote_target_id):
        vote_target_id   = voter
        vote_target_zxid = peer_zxid
        vote_view.clear(); vote_view[self.id] = voter
        broadcast Vote { self.id, self.last_zxid, current_epoch, leader=voter }
    vote_view[voter] = leader_chosen
    check_election()

check_election():
    target = vote_target_id
    if count(v in vote_view.values() : v == target) >= quorum:
        if target == self.id: become_leading()
        else:                 become_following(target)

LookForLeader is structurally a Vote for the sender: it lets a late-arriving node bootstrap a tally without waiting for the next broadcast cycle. Non-Looking peers reply to a Vote with their own current vote (which points at the live leader), so isolated nodes converge fast on heal.

Discovery & Synchronization (Phases 1–2)

become_leading():
    role = Leading
    pending_new_epoch = max(accepted_epoch, current_epoch) + 1
    accepted_epoch    = pending_new_epoch
    epoch_acks = {self.id}
    broadcast NewEpoch(pending_new_epoch)
    try_finish_discovery()      # handles the n=1 case immediately

on NewEpoch(e) from L:
    if e > accepted_epoch:
        accepted_epoch = e
        if role != Following: become_following(L)
        reply AckEpoch(current_epoch, last_zxid)
    elif e == accepted_epoch:
        reply AckEpoch(current_epoch, last_zxid)   # idempotent
    reset_election_deadline()

on AckEpoch from F (only if Leading):
    epoch_acks += F
    try_finish_discovery()

try_finish_discovery():
    if |epoch_acks| < quorum: return
    current_epoch = pending_new_epoch
    next_counter  = 0
    leader_acks   = {self.id}
    broadcast NewLeader(current_epoch, history.clone())
    try_finish_sync()

on NewLeader(e, hist) from L:
    if e >= accepted_epoch:
        accepted_epoch = e
        current_epoch  = e
        history        = hist          # follower truncates / extends to leader's history
        if role != Following: become_following(L)
        reset_election_deadline()
        reply AckLeader(e)

on AckLeader(e) from F (only if Leading and e == current_epoch):
    leader_acks += F
    try_finish_sync()

try_finish_sync():
    if synced or |leader_acks| < quorum: return
    synced = true
    if last_zxid() > last_committed:
        last_committed = last_zxid()
        broadcast Commit(last_committed)

Broadcast (Phase 3)

propose(payload):
    require role == Leading and synced
    next_counter += 1
    zxid = (current_epoch, next_counter)
    history.push(Txn { zxid, payload })
    ack_set[zxid] = {self.id}
    broadcast Propose(Txn { zxid, payload })
    try_commit(zxid)                   # single-node case

on Propose(txn) from L (only if Following and L == leader_id):
    if txn.zxid > last_zxid():
        history.push(txn)
        reset_election_deadline()
        reply Ack(txn.zxid)

on Ack(zxid) from F (only if Leading):
    ack_set[zxid] += F
    try_commit(zxid)

try_commit(zxid):
    if zxid <= last_committed: return
    if |ack_set[zxid]| >= quorum:
        last_committed = zxid
        broadcast Commit(zxid)

on Commit(zxid) from L:
    if L is current leader:
        reset_election_deadline()
    if last_committed < zxid <= last_zxid():
        last_committed = zxid

Simulator loop (per tick `t in 0..rounds`)

1. enqueue scheduled proposals : if t == schedule[i], push payload onto pending
2. inject pending into leader  : pick (Leading and synced, lowest id); call propose
3. deliver in-flight           : pop heap entries with delivery_time <= t
4. tick all nodes              : iterate in ascending id; on_tick may fire election or heartbeat

Proposal schedule: schedule[i] = (i+1) * rounds / (K+1) for i in 0..K (integer division). Each payload is the byte string "zab-<i>" (plain decimal, no padding). Deterministic, evenly spread, and independent of cluster behaviour.

Wire format (Rpc)

Nine variants. The simulator never serializes RPCs — it passes them as typed values in memory. The only bytes that ever get hashed are the canonical dump.

LookForLeader { src_id, last_zxid, peer_epoch }
Vote          { voter_id, last_zxid, peer_epoch, leader_id }
NewEpoch      { new_epoch }
AckEpoch      { current_epoch, last_zxid }
NewLeader     { new_epoch, history: Vec<Txn> }
AckLeader     { new_epoch }
Propose       { txn: Txn }
Ack           { zxid }
Commit        { zxid }

Canonical dump format

file := magic[8 = "DSEZAB01"] u32_le(node_count) node*

node := u32_le id
        u8     role               # Looking=0, Following=1, Leading=2
        u32_le current_epoch
        u32_le accepted_epoch
        u32_le last_zxid.epoch
        u32_le last_zxid.counter
        u32_le last_committed.epoch
        u32_le last_committed.counter
        u32_le history_len
        txn * history_len

txn  := u32_le zxid.epoch
        u32_le zxid.counter
        u32_le payload_len
        u8 payload[payload_len]

Nodes appear in ascending id order. All multi-byte numbers are little-endian. The dump is hashed with SHA-256; the lowercase hex digest is what zabctl prints (no trailing newline).

Primary-order property

ZAB's defining guarantee, distinct from generic atomic broadcast, is primary order: if a primary (leader) broadcasts proposals p then q in that order, every follower that delivers both delivers p before q. This is enforced trivially by the leader's monotonically increasing next_counter and the follower's txn.zxid > last_zxid() gate on Propose. Primary order is a per-primary property; across leadership changes the guarantee is provided by the Discovery / Sync handshake that explicitly chooses the surviving primary's history.

Cross-language invariants

Invariant	Why it matters
splitmix64 constants `0x9E3779B97F4A7C15`, `0xBF58476D1CE4E7B5`, `0x94D049BB133111EB`	identical PRNG output
`election_deadline = t + 150 + splitmix64(seed ^ node_id ^ t) % 150`	identical election firing times
`delivery_delay = 1 + splitmix64(seed ^ src ^ dst ^ t) % 3`	identical message scheduling
heap order `(delivery_time, sender, seq)`; `seq` global monotonic	identical delivery sequence
peers iterated in ascending id (`BTreeMap` / `std::map` / explicit loop)	identical broadcast order
`vote_view` keyed by voter id, iterated in ascending id	identical election tally
election tie-break: lexicographic `(last_zxid, voter_id)`	identical leader choice
leader-pick for proposal injection: `Leading && synced && min id`	identical client routing
proposal schedule `(i+1)*rounds/(K+1)`; payload `"zab-<i>"` unpadded decimal	identical pending queue contents
`propose()` calls `try_commit()`	identical `last_committed` for n=1
`Role` enum order `Looking=0, Following=1, Leading=2`	identical dump bytes
dump magic `"DSEZAB01"`; all integers u32 LE; nodes in ascending id	identical dump bytes

If any one of these drifts, scripts/cross_test.sh will fail and cmp -l on the two raw dumps will print the byte offset of the first divergence.

Files

src/rust/ — zab19 crate + zabctl binary.
src/go/ — module github.com/10xdev/dse/db19 + cmd/zabctl.
src/cpp/ — db19_lib static library + zabctl binary + test_db19.
scripts/verify.sh — runs the unit tests for all three.
scripts/cross_test.sh — proves the three binaries produce byte-identical canonical dumps for six seeded scenarios.

See docs/ for the long-form write-up and steps/ for the staged implementation path.

db-19 — References

Primary sources

Benjamin Reed and Flavio P. Junqueira, A simple totally ordered broadcast protocol, LADIS 2008. The original ZAB paper — short, workshop-length, and the only place that describes the algorithm in the exact "Phase 0 / 1 / 2 / 3" shape it took inside ZooKeeper. https://dl.acm.org/doi/10.1145/1529974.1529978
Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini, Zab: High-performance broadcast for primary-backup systems, DSN 2011. The peer-reviewed, formal treatment. Defines the primary order property, gives the proof obligations, and folds the original Phase 0 into Phase 1. This is the paper to cite when arguing the correctness of any particular handshake decision. https://marcoserafini.github.io/papers/zab.pdf
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed, ZooKeeper: Wait-free coordination for Internet-scale systems, USENIX ATC 2010. Describes the system (znodes, sessions, watches, the wait-free API) that ZAB exists to support. Useful for understanding why ZAB was designed with primary order rather than as a generic consensus library. https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf

Implementations to read alongside

Apache ZooKeeper (Java) — the canonical implementation. The classes worth reading first are FastLeaderElection, Leader, Follower, Learner, LeaderZooKeeperServer, and FollowerRequestProcessor. The election logic lives in FastLeaderElection.lookForLeader(); the discovery/sync handshake in Leader.lead() and Follower.followLeader(). https://github.com/apache/zookeeper
Kafka KRaft (KIP-500) — Kafka's replacement for ZooKeeper-based metadata. KRaft is Raft, not ZAB; reading the KIP is useful for understanding why ZooKeeper's biggest user finally moved off of it (operational complexity, not algorithm correctness). https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum

Determinism and simulation

db-16's references on FoundationDB simulation testing and TigerBeetle apply verbatim here. The (delivery_time, sender, seq) heap and the splitmix64-seeded jitter are the same discipline.
The ZooKeeper test suite (zookeeper/src/java/test/.../quorum/) uses scripted scenarios but is not deterministic in the cross-language sense this lab aims for. Worth reading as an example of how the production team tests the algorithm.

Background reading worth doing

Heidi Howard, Distributed consensus revised, Cambridge PhD dissertation, 2019; the 2020 survey A Generalised Solution to Distributed Consensus unifies Paxos, Raft, and ZAB under a single quorum-intersection framework. Helps see ZAB as one point in a design space rather than as an oddball. https://www.cl.cam.ac.uk/~hh360/
Leslie Lamport, Paxos Made Simple, 2001. The contrast with ZAB is illuminating: Paxos picks a value per slot; ZAB streams a totally ordered log under a primary. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014 — the Raft paper. Read this before the ZAB papers if you have not already; the comparison in db-17's CONCEPTS.md is the recommended on-ramp. https://raft.github.io/raft.pdf
André Medeiros, ZooKeeper's Atomic Broadcast Protocol: Theory and Practice, Aalto University seminar notes, 2012. A 14-page treatment of ZAB-vs-implementation gotchas; useful when the papers feel terse. https://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

Cross-lab dependencies

Upstream:
- db-16 — distributed-fundamentals: Lamport/VC and the deterministic simulator harness whose discipline this lab inherits wholesale.
- db-17 — Raft: same simulator skeleton; reading Raft first makes ZAB's discovery/sync handshake feel like the explicit version of Raft's implicit AppendEntries consistency check.
- db-18 — Paxos: the other consensus reference point; ZAB's (epoch, counter) is the streaming-log analog of Paxos's (ballot, slot) numbering.
Downstream:
- db-20 — Distributed KV: wraps a consensus engine (could be Raft, ZAB, or Paxos from this track) around a key-value state machine.
- db-21 — Storage-engine-advanced: snapshots and log compaction on top of the canonical history laid down here.
- db-23 — Capstone: composes the simulator harness across multiple replicated shards.

db-19 — Analysis

Required invariants

Election agreement. At most one node finishes a successful election cycle with role == Leading && synced per epoch. Enforced by majority voting in vote_view plus the strictly increasing pending_new_epoch = max(accepted_epoch, current_epoch) + 1 rule: any competing prospective leader sees a higher accepted_epoch and steps down (via NewEpoch rejection) before it can sync.
Primary order. If a single primary broadcasts proposals p then q, every follower that delivers both delivers p before q. Enforced by the leader's monotonically increasing next_counter (no gaps, no reuse within an epoch) plus the follower's txn.zxid > last_zxid() gate on Propose (out-of-order proposals are silently dropped rather than re-ordered into the log).
Integrity. The leader only proposes once it is synced, and followers only append once current_epoch has been adopted via NewLeader. Followers will not append a Propose whose zxid <= last_zxid(), so a stale leader's late Propose for an already- superseded epoch cannot corrupt a follower's history.
Agreement on committed prefix. If a follower has `last_committed

= z, every other follower's history contains every txn with zxid <= z(becauseCommit(z)is only broadcast after a quorum has appended every txn up throughz, and a future leader must include any quorum's committed prefix via the Discovery AckEpoch(last_zxid)` reports → it adopts the surviving history).
Total order. All followers deliver committed transactions in the same order (the leader-assigned zxid order). This follows directly from primary order + agreement on committed prefix.
Byte determinism. For every (seed, nodes, rounds, proposals, partition) tuple, the three binaries produce identical canonical_dump bytes — hence identical sha256 hex on stdout. scripts/cross_test.sh checks six scenarios.

Design decisions

propose() calls try_commit() at the end. Same single-node argument as db-17 Raft: a one-node cluster is its own majority, no Ack will ever arrive to drive the commit, so the leader must run the quorum check inline. Harmless for n > 1 because the majority check rejects until acks actually arrive.
Sorted iteration on every wire-affecting loop. Rust uses BTreeMap / BTreeSet; C++ uses std::map / std::set; Go uses explicit for p := uint32(0); p < n; p++ loops. The Go code also sorts before iterating wherever a map[uint32]... is read for output (the canonical dump and broadcast loops). HashMap would compile and pass single-language tests but fail cross_test.sh immediately.
LookForLeader is structurally a Vote for the sender. The Rust handle arm folds LookForLeader directly into the Vote arm. This avoids a separate LookForLeaderReply and gives a late-arriving Looking node the ability to tally an immediate self-vote from the source. The Go and C++ implementations do the same fold.
Non-Looking nodes reply to a Vote with their own current vote. An isolated Looking node sending a Vote to peers who are already Following gets back a Vote pointing at the live leader; combined with the lex-update rule, the isolated node converges on the existing leader in O(1) round-trips after partition heal, rather than starting a new election.
Vote lex comparison is (last_zxid, voter_id), not (last_zxid, leader_id). The voter's id is the tie-breaker when histories are equal — this is what makes the highest-id node win a cold-start election. Using leader_id instead would create a fixed-point where any node can vote for any leader and the tally never makes progress.
pending_new_epoch = max(accepted_epoch, current_epoch) + 1. The max covers the case where this node has previously acknowledged a NewEpoch for a leader that then failed before reaching sync. Without the max, the new leader could pick an epoch that some follower has already rejected, leaving sync stalled forever.
AckEpoch is idempotent. A follower that has already adopted accepted_epoch = e replies again on a re-sent NewEpoch(e). This keeps the discovery handshake robust against the heartbeat-driven re-send loop in on_tick while the leader is still gathering acks.
NewLeader ships the whole history, not a diff. Following the ZAB paper. For a study lab this is fine; production ZooKeeper uses SNAP / DIFF / TRUNC variants to avoid bulk transfer. Replacing this with a diff would be a substantial change to the RPC layer and is out of scope.
Heartbeats re-broadcast the last Commit. Once synced, the leader re-broadcasts Commit(last_committed) every 50 ticks. This doubles as the "leader is alive" signal that resets the follower's election timer. Sending a dedicated Heartbeat RPC would be one more wire variant for no behavioural gain.
Proposal schedule is closed-form. schedule[i] = (i+1) * rounds / (K+1) (integer division). Same rationale as db-17: decoupling proposal timing from cluster scheduling decisions keeps the dump bytes from depending on incidental tick alignment.
Library + thin CLI. The lab exposes Cluster::new, run, canonical_dump, and sha256 as a library. The CLI is a few dozen lines of arg parsing plus four function calls.

Tradeoffs worth flagging

No snapshots, no SNAP/DIFF/TRUNC. The leader sends the full history on every NewLeader. For the bounded rounds of this lab the cost is trivial; for production ZooKeeper it would be prohibitive on large datasets. Snapshots are deferred to db-21.
No client sessions, no znodes, no watches. ZAB exists to serve ZooKeeper, but this lab implements ZAB in isolation. The "state machine" is the history vector itself. Anything ZooKeeper-API- shaped (sessions, ephemerals, watches, ACLs) is downstream of the consensus core and lives in a different lab.
Crash semantics are stylized. Crashes are simulated only via the partition flag (drop all messages in one direction). A real ZooKeeper must handle persistent storage corruption, fsync ordering, and restart-mid-vote; the canonical dump pretends all state is durable by construction.
No Observer role. Production ZooKeeper has non-voting Observer servers that learn from the leader but do not participate in quorum. They are pure read-fanout and add no algorithmic complexity, so they were left out of the simulator.
No client-side dedup. A proposal injected into a leader who immediately loses leadership may be replicated, lost, and never re-proposed. The simulator's cluster_pending queue is drained unconditionally; we are testing the consensus core, not the client RPC layer.
Follower truncation is by replacement, not by prefix-match. When a follower receives NewLeader(e, hist), it adopts hist wholesale, even if its own history shares a prefix. This is correct (the leader's history is authoritative for the new epoch) but heavier than necessary; a real implementation would diff.

Why three languages

Same reasoning as db-16 and db-17, plus one new lesson specific to ZAB: the algorithm has two quorum-tracking sets that are easy to get subtly wrong (epoch_acks for discovery, leader_acks for sync, plus the per-zxid ack_set for broadcast). Each set must be iterated in a stable order for the dump, and each must include the leader's own id on initialization. The cross-language test catches both mistakes immediately — forgetting to add self.id to epoch_acks costs a tick of discovery time that perturbs every downstream delivery and changes the dump bytes.

db-19 — Execution

One-shot: prove the lab works

cd db-19-zab
./scripts/verify.sh        # all unit tests in Rust, Go, C++
./scripts/cross_test.sh    # byte-identical sha256 across all three, six scenarios

A green run of cross_test.sh ends with the literal line:

=== ALL OK ===

Per-language workflows

Rust

cd src/rust
cargo test --release       # ~10 tests
cargo build --release      # produces target/release/zabctl
./target/release/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

Go

cd src/go
go test ./...              # ~9 tests
go build -o /tmp/zabctl_go ./cmd/zabctl
/tmp/zabctl_go --seed 42 --nodes 3 --rounds 1000 --proposals 5

C++

cd src/cpp
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build     # test_db19 — 10 assertions
./build/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5

CLI

All three binaries accept the same flags and print lowercase hex sha256 of the canonical dump to stdout with no trailing newline:

flag	default	meaning
`--seed N`	0 (Go) / 42 (Rust) / 0 (C++)	splitmix64 seed mixed into election timers and message delays
`--nodes K`	3	number of ZAB nodes (1 is legal; majority is then 1)
`--rounds R`	0/1000	number of simulator ticks to run
`--proposals P`	0	number of client commands to inject during the run
`--partition s,d,...`	none	comma-separated pairs `(src, dst)` to drop in that direction

(Flag defaults drift between langs because the cross-test script always passes every flag explicitly. Only behavior under explicit flags is part of the cross-language contract.)

Canonical scenarios

scripts/cross_test.sh runs six scenarios; their sha256s are listed in docs/observation.md. If any change, cross_test.sh will exit non-zero.

label	args
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`

D exercises the single-node-leader code path that motivated the propose() → try_commit() call. E isolates node 0 completely; the other two must elect a leader and commit the remaining proposals (the surviving quorum's history is what ends up in node 1 and 2's dump). F is an asymmetric partition that causes term churn but recoverable replication.

Sanity checks

# Pick any scenario and round-trip — the hash is content-defined.
./src/rust/target/release/zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5
# expect: 16af5aa6dbd5ce09b259755f3339d6cf23966ce115b0e30d9c2990487783047d

# Magic of the canonical dump (use the lib directly; the CLI hashes it):
#   - Rust:  TestDumpDeterministicAcrossRuns asserts da.starts_with("DSEZAB01").
#   - Go:    TestDumpDeterministicAndMagic   asserts the same.
#   - C++:   test_dump_deterministic_and_magic in tests/test_db19.cc.

Tunables (CONCEPTS.md cross-reference)

HEARTBEAT_INTERVAL = 50 — leader re-broadcasts last Commit every 50 ticks.
ELECTION_TIMEOUT_MIN = 150, ELECTION_TIMEOUT_SPAN = 150 — base + jitter for follower election deadline.
DELIVERY_DELAY_SPAN = 3 — message delivery delay is 1 + splitmix64(seed ^ src ^ dst ^ t) % 3 ticks.

Changing any of these changes every canonical hash. The intent is that the lab is a fixed-point study object: the values are part of the contract.

db-19 — Observation

What the cross-language test produces and how to read it by hand.

Expected sha256s

scripts/cross_test.sh runs six scenarios and asserts the three binaries (Rust, Go, C++) all print the same hex digest. The current canonical hashes are:

label	args	sha256
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`	`16af5aa6dbd5ce09b259755f3339d6cf23966ce115b0e30d9c2990487783047d`
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`	`b60388e978a9b98792edb00c8d33217da8bff9945a89d2c0c18b5f69520b91cf`
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`	`8aef7604639fe0f2b349b38d74e10b6da8ac252b626976563bba69c722426296`
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`	`d4dbb92f91f9a0adf0c4c0b91fa46b2a5145907450897cd6473a02a6279604fd`
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`	`5e4dbddb605e469c99fb682c00256445dcb2ed07e984f673d4296ef19719979a`
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`	`c9df583bd714534c488aac710e6cc6e57e4b21d2fe96ec17068bd1c7525bc1b3`

If any of these change, cross_test.sh will fail. Either you have a bug, or you have intentionally changed the spec (timer constants, schedule formula, dump layout) and you must update this table in the same commit.

What the canonical dump looks like (scenario D — single node)

--seed 1 --nodes 1 --rounds 200 --proposals 5. Five proposals into a single-node cluster — the leader is itself the majority, so every proposal commits immediately and discovery/sync collapse to a no-op (quorum reached on self.id).

offset 0x00 :  44 53 45 5A 41 42 30 31    "DSEZAB01"        magic
offset 0x08 :  01 00 00 00                 1                 node_count
offset 0x0c :  00 00 00 00                 0                 node id
offset 0x10 :  02                          role = Leading (2)
offset 0x11 :  XX XX XX XX                 current_epoch     (== 1 if no churn)
offset 0x15 :  XX XX XX XX                 accepted_epoch    (== current_epoch)
offset 0x19 :  XX XX XX XX                 last_zxid.epoch   (== current_epoch)
offset 0x1d :  05 00 00 00                 last_zxid.counter = 5
offset 0x21 :  XX XX XX XX                 last_committed.epoch
offset 0x25 :  05 00 00 00                 last_committed.counter = 5
offset 0x29 :  05 00 00 00                 history_len = 5
offset 0x2d :  XX XX XX XX                 history[0].zxid.epoch
offset 0x31 :  01 00 00 00                 history[0].zxid.counter
offset 0x35 :  05 00 00 00                 history[0].payload_len = 5
offset 0x39 :  7A 61 62 2D 30              "zab-0"           payload
...

Each subsequent history entry is 4 + 4 + 4 + 5 = 17 bytes (epoch + counter + len + "zab-N"). Total dump for D is therefore 0x2d + 5 * 17 = 0x86 = 134 bytes. Exact bytes depend on whatever epoch the leader has bumped through by the time the run ends; the single-node case is nearly always current_epoch = 1.

A multi-node dump (scenario C — quiet cluster)

--seed 99 --nodes 3 --rounds 500 --proposals 0. No proposals; the cluster elects a leader, runs through discovery + sync, then heartbeats for the rest of the run. Every node's history is empty:

44 53 45 5A 41 42 30 31         magic
03 00 00 00                     node_count = 3

00 00 00 00                     node id 0
XX                              role            (Following or Leading)
XX XX XX XX                     current_epoch   (1 if first election succeeded clean)
XX XX XX XX                     accepted_epoch
00 00 00 00 00 00 00 00         last_zxid       (0, 0)
00 00 00 00 00 00 00 00         last_committed  (0, 0)
00 00 00 00                     history_len = 0

01 00 00 00                     node id 1
... same shape ...

02 00 00 00                     node id 2
... same shape ...

Total dump: 8 + 4 + 3 * (4 + 1 + 4 + 4 + 4 + 4 + 4 + 4 + 4) = 105 bytes. (33 bytes per node with empty history.)

How to debug a divergence

If cross_test.sh fails, write the raw dumps to disk (the CLI prints only the hash; you'll need a one-liner that calls canonical_dump directly, or modify zabctl.rs / main.go / zabctl.cc to dump the raw bytes instead of the hash). Then:

cmp -l /tmp/zab_A_rust.bin /tmp/zab_A_go.bin | head
xxd /tmp/zab_A_rust.bin | sed -n '<line>,+2p'
xxd /tmp/zab_A_go.bin   | sed -n '<line>,+2p'

The first divergence offset tells you what to look at:

offset range	likely culprit
0x00–0x07	magic (typo: `DSEZAB01` not `DSEZAB1` or `DSEZAB02`)
0x08–0x0b	node_count (impossible if all three accept `--nodes` correctly)
inside a node block, on `role`	enum mapping (Looking=0, Following=1, Leading=2)
inside a node block, on `current_epoch` / `accepted_epoch`	discovery handshake bug; the leader's `pending_new_epoch` likely didn't `max()` against `current_epoch`
inside a node block, on `last_zxid`	counter reset on epoch change wrong (must reset to 0; first new proposal has counter 1)
inside a node block, on `last_committed`	`try_commit` quorum count wrong, or `propose()` not calling `try_commit` (n=1 case)
inside `history_len`	follower `Propose` filter wrong (out-of-order zxid not dropped), or `NewLeader` replacement not adopting leader's history
inside a history entry	broadcast loop iteration order — must be ascending peer id

In all six existing scenarios these checks pass; the table above is the runbook for the day someone changes the algorithm and forgets to update one of the three implementations.

Tick-level scope (Rust REPL trick)

To watch a scenario from the inside, add this temporary print in Cluster::run before the per-tick loop body:

#![allow(unused)]
fn main() {
if std::env::var("ZAB_TRACE").is_ok() {
    eprintln!(
        "t={} roles={:?} epochs={:?} commits={:?}",
        t,
        self.nodes.iter().map(|n| n.role).collect::<Vec<_>>(),
        self.nodes.iter().map(|n| n.current_epoch).collect::<Vec<_>>(),
        self.nodes.iter().map(|n| n.last_committed.counter).collect::<Vec<_>>(),
    );
}
}

then run ZAB_TRACE=1 zabctl --seed 42 --nodes 3 --rounds 1000 --proposals 5 | head -50. The trace goes to stderr; the canonical dump's sha256 still goes to stdout unchanged. Remove before commit.

Reading the hashes themselves

The hashes are arbitrary — they are SHA-256 of a binary blob whose bytes encode every node's state at the end of the run. There is no way to look at 16af5aa6... and infer anything about the cluster. What matters is that the same input produces the same output in three languages and that the table above doesn't drift unintentionally.

For human-readable insight, dump canonical_dump(&c) to a file and run xxd over it, or print individual node states in a test rather than at the CLI surface.

db-19 — Verification

Prerequisites

Rust ≥ 1.74 with cargo on PATH.
Go ≥ 1.22 (module declares go 1.22).
CMake ≥ 3.20 and a C++17 compiler (Apple clang ≥ 14, gcc ≥ 11).
A POSIX sha256sum is not required — each binary computes its own sha256 in-process.

One command

cd db-19-zab
bash scripts/verify.sh && bash scripts/cross_test.sh

Green is === db-19 :: ALL UNIT TESTS GREEN === followed by === ALL OK ===. Anything else is a regression.

What `verify.sh` does

Rust — cargo test --release --quiet over src/rust/. Builds db19 lib + zabctl binary; runs the inline tests in src/rust/src/lib.rs. Expected: test result: ok.
Go — go test ./... over src/go/. Builds the db19 package + cmd/zabctl; runs src/go/zab_test.go. Expected: PASS and ok github.com/10xdev/dse/db19.
C++ — cmake -DCMAKE_BUILD_TYPE=Release -B build, cmake --build build --target test_db19 zabctl, then ctest --test-dir build --output-on-failure. The test binary ends with Test #1: test_db19 ........ Passed.

If any of these three blocks fails, the script exits non-zero and the rest does not run.

What `cross_test.sh` does

For each of the six canonical scenarios (A–F) it invokes the three release zabctl binaries with identical flags, captures the lowercase-hex sha256 of the canonical cluster dump, and asserts rust == go == cpp byte-for-byte. The scenarios are:

label	args	what it exercises
A	`--seed 42 --nodes 3 --rounds 1000 --proposals 5`	basic 3-node, 5 proposals, clean network
B	`--seed 7 --nodes 5 --rounds 2000 --proposals 20`	bigger cluster, longer horizon
C	`--seed 99 --nodes 3 --rounds 500 --proposals 0`	election convergence only
D	`--seed 1 --nodes 1 --rounds 200 --proposals 5`	degenerate single node (instant leader)
E	`--seed 42 --nodes 3 --rounds 1000 --proposals 3 --partition 0,1,0,2,1,0,2,0`	3-node with churn
F	`--seed 3 --nodes 5 --rounds 1500 --proposals 10 --partition 0,1`	5-node, asymmetric one-way drop

Canonical hashes are listed in docs/observation.md. The script asserts consistency among the three ports; it is docs/observation.md that pins them to the historical fingerprint.

What green guarantees

Determinism. Same flags ⇒ same canonical dump bytes across languages and runs (modulo endianness — all targets are little-endian). The simulator advances in integer ticks; all map/set iteration is over BTreeMap / sorted Go slices / std::map so the dump order is fixed.
Safety in the modeled environment. No two nodes commit different histories. For every scenario in the suite, after the final tick:
- last_committed.epoch is monotonic per node.
- Where two nodes' history overlap by zxid, the bytes match.
- No follower has committed past last_committed reported by the leader of its current epoch.
Liveness in the modeled environment. Scenarios A, B, D, F include proposals and run long enough to elect a leader and commit them. Scenarios C and E confirm we don't commit what we shouldn't (C has zero proposals; E partitions away the would-be leader so the alternative path must take over).

What green does not guarantee

Behavior outside the canonical scenarios. ZAB's state space is large; six fingerprints are an acceptance test, not a model checker. Real validation needs TLA+ (see references.md).
Performance. No latency or throughput is measured. Tick count is simulation cost, not wall-clock SLA.
Snapshotting / log compaction. Histories grow unboundedly; ZooKeeper truncates via snapshots, which is out of scope here.
Production safety primitives — fsync barriers, on-disk checksums, recovery from torn writes, byzantine actors. All deliberately deferred.
Real network. Partitions are modeled as a BTreeSet of one-way drops applied at delivery; reordering happens through the simulator's priority queue, not a Lossy/OOO network. There is no actual socket.

Invariant assertions in code

The implementations carry inline assertions where they are nearly free. The load-bearing ones:

Where	Assertion	What it catches
Leader `on_ack`	refuse acks for zxids not in our outstanding set	duplicate / replayed acks inflating quorum
`update_vote` (election)	only adopt votes with greater `(last_zxid, id)`	non-monotone vote drift
`handle_new_epoch`	followers must reply only if `new_epoch > accepted_epoch`	accepting a stale epoch from a deposed leader
`handle_new_leader`	followers replace history only if `new_epoch > current_epoch`	losing already-committed entries
`canonical_dump` writer (all 3 langs)	nodes in ascending id, per-node history in ascending zxid	dump-writer drift between languages

The unit tests assert each of these on at least one path.

db-19 — Broader Ideas

The lab implements textbook ZAB (epoch + counter, leader-driven broadcast, discovery + sync on leadership change) with a deterministic simulator and three-language cross-validation. It deliberately stops where production engineering begins. This document collects the threads worth pulling on next.

ZAB-with-snapshots

Production ZooKeeper periodically truncates history by snapshotting the in-memory state machine and dropping txns whose zxid is below the snapshot's. Followers that fall behind the leader's snapshot are fast-forwarded with SNAP (whole-state copy) rather than DIFF (replay tail). Worth implementing as db-19b — it reuses the wire format and adds a Snap { zxid, state_bytes } RPC alongside the existing NewLeader payload.

Fast Leader Election (production form)

Real ZooKeeper's FLE has tie-breaking by peer epoch (the highest epoch this voter has ever seen) before falling back to (last_zxid, id). The lab uses just (last_zxid, id) which is enough for safety but loses an optimization: a node that just lost leadership often still has the highest peer-epoch and should regain leadership quickly. Worth a db-19c.

Observer mode

Observers receive Commit but never vote in elections or quorums. ZooKeeper added them at scale to push read traffic past the voter-set throughput ceiling without inflating quorum sizes. The simulator extension is small: add a Role::Observer, exclude it from quorum counts, still deliver every Commit.

Read-only mode (RO clients during partition)

When a quorum dies but some nodes remain, ZooKeeper exposes those survivors in a read-only mode that serves last-known committed state. This is a useful failure-mode case for the simulator: drop into RO when no quorum responds within an election cycle.

Cross-epoch zxid ordering

Production ZAB stuffs (epoch, counter) into one u64 (32 bits each). The lab uses a struct for clarity; switching to the packed form is a one-line change and would let zxid live in atomic operations on real hardware. Worth a benchmark in db-22.

Production systems to study

Apache ZooKeeper

The canonical implementation. Read the original ZAB paper (Junqueira, Reed, Serafini — Zab: High-performance broadcast for primary-backup systems, DSN 2011) alongside the source in org.apache.zookeeper.server.quorum. The simulator in this lab maps directly onto Leader.java, Follower.java, and FastLeaderElection.java.

Kafka KRaft (Raft replacement for ZooKeeper)

Confluent's argument against keeping ZooKeeper as a dependency was operational: two consensus systems (ZAB for metadata, Kafka's own ISR for log replication) doubled the failure-modes and runbooks. KIP-500 replaces ZAB with a Raft-style log inside Kafka itself. A good real-world counterpoint to read alongside db-17 (Raft).

Curator / Recipes

Apache Curator's "recipes" (locks, leader latches, distributed queues) are layered on top of ZooKeeper. They are a clinic in how not to misuse a primary-order primitive: every recipe pins its watch semantics + retry policy explicitly because ZK ephemeral nodes are not ACID transactions.

Etcd v2 vs v3

Etcd v2 used a ZAB-like broadcast; v3 moved to Raft for the same operational reasons as Kafka. Comparing v2's raft.go (gone, but in git history) and v3's raft/ is instructive — same problem, different state machine, near-identical wire bytes.

Chubby (Google)

Chubby is Multi-Paxos-based, not ZAB, but the lease + session model in ZooKeeper traces directly back to Chubby. Burrows's OSDI 2006 paper is the canonical writeup; read it after this lab and before db-23.

Performance experiments worth running

The simulator's ticks are a unit of cost for comparative experiments:

Quorum-size sweep. For nodes ∈ {3, 5, 7, 9}, run proposals = 50 and count ticks to commit the last proposal. Expected: commit cost rises slowly with quorum size (one extra round-trip per added node), election cost rises sharply (vote table doubles).
Discovery+sync cost on leadership churn. Vary the partition schedule's --partition density. The lab's E scenario has 4 churn events in 1000 ticks; the more churn, the higher the ratio of NewEpoch/NewLeader bytes to Propose/Commit bytes in the dump. Plot that ratio.
Comparison to Raft (db-17) and Paxos (db-18). Same flag surface (--seed --nodes --rounds --proposals --partition) and same scenarios — lab structure is identical on purpose. Compare scenario-A commit latency across the three protocols.

What "production-quality" would require beyond this lab

Durable storage. history, current_epoch, accepted_epoch must survive kill -9 and power loss. Real ZooKeeper writes a WAL (see db-03) and snapshots every N transactions.
Real network. Sockets, TCP retransmits, framing, TLS, auth. The simulator's OutMsg collapses all of that.
Client sessions. ZooKeeper's session-id ↔ ephemeral-node binding is a major protocol surface in its own right; not modeled here.
Watches. The pub/sub layer on top of read-paths. Adds a fan-out table and a per-session notification queue.
Cluster reconfiguration. Adding/removing voters safely is its own protocol (joint quorum on the membership txn). Out of scope.
Recovery from torn writes. Per-page checksums on the WAL.
Adversarial inputs. ZAB assumes crash-stop failures only. A Byzantine variant (BFT-ZAB, e.g. BFT-SMaRt) is a separate code base entirely.

db-19 step 01 — Epoch, zxid, and Fast Leader Election

Goal

Build the persistent state every ZAB node carries and the election protocol that picks the next leader when no one is currently broadcasting. Election must converge in bounded ticks for any quorum-available network, and the chosen leader must always be the node with the highest committed zxid in the surviving quorum.

Tasks

ZxId { epoch: u32, counter: u32 } with lexicographic ordering (epoch first, then counter). Provide a ZxId::ZERO constant. Every zxid comparison in the rest of the protocol uses this ordering — never compare the u64 representations directly, because the lab keeps them as a struct for clarity.
Persistent node state. A ZabNode carries:
- id: u32, n_nodes: u32, quorum = n_nodes/2 + 1.
- role: Role (Looking / Following / Leading).
- current_epoch: u32 — the epoch of the leader we last followed. Bumped on NewLeader.
- accepted_epoch: u32 — the epoch we promised on NewEpoch. Always >= current_epoch.
- history: Vec<Txn> — committed and uncommitted txns in zxid order.
- last_committed: ZxId — high-water mark; entries <= this have been applied.
These are the four values that would survive a crash in a real implementation. Everything else (vote tables, ack tables) is transient and rebuilt on the next election.
Rpc::LookForLeader / Vote. A Looking node broadcasts its current vote each tick. On receiving a peer's vote, update via update_vote(peer.last_zxid, peer.id):
- Adopt peer as our vote target if (peer.last_zxid, peer.id) > (current_vote.last_zxid, current_vote.id).
- Record the peer's choice in vote_view[voter_id] = leader_id.
Quorum detection. Walk vote_view and count entries whose value equals each candidate id. The first candidate (in id order) whose count >= quorum becomes the elected leader. If that leader is us, transition to Leading; otherwise transition to Following with leader_id = Some(...).
Election timeout. Track election_deadline per node. If now > election_deadline and we're still Looking, reset the vote table and broadcast a fresh LookForLeader from our current (last_zxid, id). Reseed the deadline with ELECTION_TIMEOUT_MIN + splitmix64(seed) % ELECTION_TIMEOUT_SPAN.

Acceptance

Inline unit tests in each language. Names below are the Rust form; Go uses TestZxIdOrdering style, C++ uses test_zxid_ordering:

zxid_ordering_is_lexicographic — ZxId{0,9} < ZxId{1,0}, ZxId{1,0} < ZxId{1,1}, ZERO < ZxId{0,1}. Locks the comparator.
vote_adopts_higher_last_zxid — node 0 with last_zxid=(1,5) votes for itself; receives a vote from node 1 with (2,0); adopts node 1. Then receives from node 2 with (2,0) — does not re-adopt (tie on zxid, lower id loses).
quorum_of_votes_elects_highest — in a 3-node cluster all voting for node 2, node 2 transitions to Leading after the second matching Vote arrives.
election_does_not_decide_in_minority — partition isolates node 0 from {1,2}; node 0 must never leave Looking regardless of how many ticks elapse.

All four green in Rust, Go, and C++.

db-19 step 02 — Discovery, sync, and atomic broadcast

Goal

Layer the steady-state ZAB protocol on top of the elected leader from step 01. The leader must bring every follower's history up to its own before accepting new proposals; once synced, the leader assigns a monotone zxid to each payload and commits it on majority ack. The dump bytes must match across Rust, Go, and C++.

Tasks

Discovery (NewEpoch / AckEpoch). The fresh leader picks new_epoch = max(self.accepted_epoch, max-peer-accepted) + 1 and broadcasts NewEpoch { new_epoch }. Each follower:
- Asserts new_epoch > accepted_epoch (refuse stale leaders).
- Sets accepted_epoch = new_epoch.
- Replies AckEpoch { current_epoch, last_zxid } — the follower's own committed epoch + tail of its history.
The leader waits for a quorum of AckEpoch (counting itself). At quorum, it knows the highest zxid that any majority node has committed; that becomes the new initial history.
Sync (NewLeader / AckLeader). Leader broadcasts NewLeader { new_epoch, history } where history is the leader's own log (which must include everything any quorum member has acked, by the contract of step 1). Each follower:
- Asserts new_epoch > current_epoch (refuse historical leaders).
- Replaces history with the leader's payload.
- Sets current_epoch = new_epoch.
- Replies AckLeader { new_epoch }.
On quorum of AckLeader, the leader broadcasts Commit for every zxid in the synced history that has not yet been committed. The cluster is now in steady state.
Broadcast (Propose / Ack / Commit). Each step tick, if there are queued proposals and we are the leader:
- Assign zxid = ZxId { epoch: current_epoch, counter: ++last_counter }.
- Append Txn { zxid, payload } to local history.
- Broadcast Propose { txn } to all followers.
Each follower asserts txn.zxid.epoch == current_epoch and txn.zxid > history.last().zxid, then appends and replies Ack { zxid }. The leader tracks acks per zxid in a BTreeMap<ZxId, BTreeSet<u32>>. On quorum (counting itself), it broadcasts Commit { zxid } and advances last_committed.
Apply on commit. Followers receiving Commit { zxid } advance last_committed = max(last_committed, zxid) and (in a real system) apply the txn to the state machine. The lab leaves the state machine implicit — last_committed and history are the only observable surface.
Canonical dump. dump_cluster(nodes) = magic("DSEZAB01") || u32 node_count || dump_node(0) || dump_node(1) || ... where dump_node = id u32 || role u8 || current_epoch u32 || accepted_epoch u32 || last_zxid (epoch,counter) || last_committed (epoch,counter) || history_len u32 || [zxid, payload_len u32, payload bytes] * history_len. All integers little-endian. hash = lowercase hex sha256 of the full byte string, no trailing newline.

Acceptance

Inline unit tests in each language:

discovery_bumps_accepted_epoch — leader elected at accepted_epoch=0 broadcasts NewEpoch{1}; followers reach accepted_epoch=1.
sync_replaces_follower_history_with_leader_history — follower with stale history receives NewLeader { history: leader_log } and ends with history == leader_log.
propose_commits_on_quorum_ack — leader in 3-node cluster proposes one txn; commits after 1 follower ack (leader + 1 = 2 = quorum). The third follower's late ack does not double-commit.
propose_does_not_commit_without_quorum — leader in 5-node cluster proposes, 1 follower acks; last_committed stays at ZxId::ZERO.
zxid_counter_is_monotone_per_epoch — three proposals get counter 1, 2, 3; if the leader's epoch bumps (next election), counter resets to 1 under the new epoch.
canonical_dump_is_byte_stable — same input scenario → same dump → same sha256 across two calls in the same process.

All six green in Rust, Go, and C++.

db-19 step 03 — Cross-language determinism

Goal

Lock the byte-level output of all three implementations (Rust, Go, C++) to the same sha256 for every canonical scenario in scripts/cross_test.sh. This is the difference between "ZAB works in my language" and "ZAB is this exact state machine".

Tasks

Deterministic RNG. splitmix64(u64) -> u64 per the spec:
```
x  += 0x9E3779B97F4A7C15
z   = (x ^ (x >> 30)) * 0xBF58476D1CE4E7B5
z   = (z ^ (z >> 27)) * 0x94D049BB133111EB
out =  z ^ (z >> 31)
```
Every random choice in the simulator (election timeout, delivery delay, partition schedule index) consumes one splitmix64 call on a per-node counter. No language may use its own rand or math/rand or <random> defaults.
Stable iteration. Every map iteration in election, ack tracking, and dump emission is over BTreeMap (Rust), std::map (C++), or a sorted []uint32 (Go). No HashMap / unordered_map / map[uint32] may appear in any code path that affects bytes-on-the-wire or bytes-in-the-dump.
Delivery order. OutMsges enqueued the same tick are delivered in FIFO order per-destination and in source-id ascending order across destinations. Implement with a BinaryHeap<(deliver_at, src_id, seq_no, msg)> (Rust) and the equivalent in Go (container/heap with the same key) and C++ (std::priority_queue). The seq_no tie-breaks duplicates within the same tick.
Partition modelling. --partition a,b,c,d,... is a list of (src, dst) one-way drops. Store as a BTreeSet<(u32, u32)>. At delivery, drop the message if (src, dst) ∈ partition_set. Symmetric partitions are expressed as 0,1,1,0. Single-arg list length must be even (no half-drop); reject odd-length input with exit code 2.
zabctl CLI surface. All three binaries accept:
```
zabctl --seed <u64> --nodes <u32> --rounds <u32> --proposals <u32> [--partition a,b,c,d,...]
```
Print the lowercase-hex sha256 of dump_cluster(...) with no trailing newline. Exit code 2 on any bad flag.
Wire-format magic. First 8 bytes of the dump are the ASCII string "DSEZAB01". Bump to "DSEZAB02" if the layout ever changes (and update docs/observation.md in the same commit).

Acceptance

scripts/cross_test.sh succeeds end-to-end on a clean checkout:

=== ALL OK ===

Each of the six scenarios A–F prints the same hex digest for Rust, Go, and C++. The canonical hashes are pinned in docs/observation.md — if any scenario changes you must update the table in the same commit, with a one-line note on what shifted (timer constant, schedule formula, dump layout).

Optional but valuable: rebuild on a second machine with a different endian-ness-irrelevant compiler (Linux/gcc vs macOS/clang) and confirm the hashes match. All targets in this study back are little-endian; the dump assumes that.

db-20 — Distributed KV Store (Concepts)

This lab is the capstone of the distributed-systems track (db-16..19). It stitches consensus and a state machine into the smallest possible replicated key/value store and exposes the result as a deterministic, byte-identical snapshot across Rust, Go, and C++.

Where it sits

Track	Lab	Provides
db-16	distributed fundamentals	failure models, CAP, FLP
db-17	Raft	a real consensus implementation
db-18	Paxos	a contrast
db-19	ZAB	another contrast
db-20	distributed KV	integration: log + state machine

The scope of db-17 is "implement Raft correctly". The scope of db-20 is "given Raft-shaped semantics, build a replicated state machine you can hash byte-for-byte across three languages." So we deliberately do not re-implement leader election, randomised timers, RPCs, or persistent log files. We model just enough of consensus to study the integration boundary.

Simplifications (vs. real Raft / db-17)

Concept	db-17	db-20
Networking	message-driven	direct in-process broadcast
Elections	randomised timeouts	fixed leader, `current_term == 1`
Followers' acks	RPC reply	function return
Log replication	`next_index` walk on conflict	one-shot snapshot push (`truncate_and_replay`)
Partition	network simulation	`Cluster::partition({ids})` drops messages
Heal / catchup	next-index probes	full log copy + replay
Persistence	log file + fsync	none (in-memory `Vec<LogEntry>`)

The simplifications are honest — they collapse implementation details that do not affect the property we care about: a leader's state-machine snapshot is identical to every healthy follower's snapshot, and identical across languages.

Data model

Op            = NoOp | Put(key, value) | Del(key)
LogEntry      = { term: u64, idx: u64, op: Op }
Replica       = { id, log: Vec<LogEntry>, commit_index, current_term,
                  state_machine: BTreeMap<String, Vec<u8>> }
Cluster       = { replicas[5], leader_idx=0, partitions, next_log_idx }

state_machine is the applied projection of the committed log prefix. We do not store tombstones — Del actually removes the entry.

Propose / commit cycle

The leader allocates the next log index and appends LogEntry{term, idx, op} to its own log.
For each follower, in id order:
- If the follower is partitioned, drop the message.
- Else call try_append(prev_idx, entry). If the follower's last_idx matches prev_idx, the entry is appended (1 ack). If not, snapshot push: truncate_and_replay(leader.log, leader.commit_index) replaces the follower's log wholesale and re-applies the committed prefix (1 ack).
If acks ≥ quorum (3/5), commit on the leader by advancing commit_index and applying to the state machine. Then in id order, advance every reachable follower's commit_index too.
Otherwise the entry stays in the leader's log uncommitted — a future successful proposal or a heal() will retro-commit it.

The total order of commits is the log order: idx 1, 2, 3, ....

Partition + heal

Cluster::partition({3,4}) adds replica ids 3 and 4 to the partitions set. Subsequent proposals do not message them and do not count their acks. If quorum is still reachable (5 − 2 = 3 ≥ 3), the cluster keeps committing. If not, every proposal returns false and the leader's tail grows uncommitted.

Cluster::heal() clears the set and, for each healed follower in ascending id order, performs truncate_and_replay(leader.log, leader.commit_index). This is db-20's stand-in for Raft's next_index-walk conflict resolution: simpler, deterministic, and good enough for the cross-language exam because the final state is identical.

Cross-language byte identity — the exam

Wire format for one replica's snapshot

magic           "DSEDKV20"           (8 bytes)
u64 LE          commit_index
u64 LE          current_term         (== 1 in this lab)
u32 LE          entry_count
  for each (k, v) in ascending k order:
    u32 LE k_len | k_bytes
    u32 LE v_len | v_bytes

Iteration order is ascending key. Rust uses BTreeMap, C++ uses std::map — both naturally ascending. Go's map iteration is randomised, so the Go implementation does an explicit sort.Strings before serialising.
Tombstoned keys are not in the dump.
All integers little-endian.

Workload spec

splitmix64 constants: 0x9E3779B97F4A7C15, 0xBF58476D1CE4E7B5, 0x94D049BB133111EB

setup:
  Cluster::new(5)
  if scenario == "partition":
    at op = ops/4   → partition([3, 4])
    at op = ops*3/4 → heal()

for op in 0..ops:
  r1, r2, r3 = rng.next() ×3
  kind = (r1 >> 62) & 0x3                # 0,1,2 → Put; 3 → Del
  k    = "k" + (r2 % keys).to_string()
  v    = u64_le(r3 % 10000)              # 8 bytes

Frozen golden hashes

Scenario	Args	sha256
A	`--seed 42 --ops 500 --keys 32 --scenario default`	`1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b`
B	`--seed 7 --ops 2000 --keys 128 --scenario partition`	`272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc`

These are baked into scripts/cross_test.sh, src/go/dkv20_test.go, and src/cpp/tests/test_dkv20.cc. Any change to PRNG / wire format / op decoding / partition timing / snapshot push will break them — which is exactly the point.

Where to look next

src/rust/src/lib.rs — the reference implementation. Read it first.
src/go/dkv20.go — port. Note the explicit sort.Strings before writing wire bytes.
src/cpp/src/dkv20.cc — port. Note the manual little-endian writers and pure-stdlib sha256.
docs/ — the long-form study notes (analysis, execution, observation, verification, broader ideas).
steps/ — the three-step study plan if you are walking the lab fresh.

db-20 — References

Distributed-system foundations and the specific consensus / replication ideas that informed this lab.

Consensus

Ongaro, D. & Ousterhout, J. In Search of an Understandable Consensus Algorithm (Extended Version). USENIX ATC 2014. https://raft.github.io/raft.pdf
Lamport, L. Paxos Made Simple. ACM SIGACT News, 2001. https://lamport.azurewebsites.net/pubs/paxos-simple.pdf
Howard, H. Distributed consensus revised. PhD thesis, Cambridge 2018. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-935.pdf

CAP / consistency models

Brewer, E. Towards Robust Distributed Systems (PODC 2000 keynote).
Gilbert, S. & Lynch, N. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT 2002.
Vogels, W. Eventually Consistent. CACM 2009.

Transactional storage

Gray, J. & Reuter, A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993. Chapter 7 on replicated data.
Mohan, C. et al. ARIES. ACM TODS 17(1), 1992 — background on why our log is append-only.

State-machine replication

Schneider, F. B. Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. ACM Comp. Surveys 22(4), 1990.

Production systems for comparison

etcd — https://etcd.io/docs/v3.5/learning/design-learner/
TiKV — https://tikv.org/docs/dev/reference/architecture/raftstore/
CockroachDB — https://www.cockroachlabs.com/docs/stable/architecture/replication-layer.html

Self-references in this repo

db-16-distributed-fundamentals/ — failure models, CAP/FLP intuitions.
db-17-raft/ — the underlying consensus algorithm.
db-09-leveldb-complete/ — the storage-engine quality bar this lab matches.

db-20 — Analysis

What is the question?

Given Raft-shaped consensus semantics, can we build a replicated state machine that produces a byte-identical snapshot across three language ecosystems? "Byte-identical" is the strongest possible test of cross-language conformance — strings, integers, map iteration order, and op semantics all have to line up.

Why is this an interesting study?

Raft on its own (db-17) tells you nothing about how a real key/value store is layered on top of it. Production systems (etcd, TiKV, CockroachDB) all answer the same questions:

What does the leader send to followers? (log entries)
When does a follower apply an entry? (when its commit_index advances)
How does a partitioned follower catch up? (next-index probe / install snapshot)
What invariants does the state machine maintain across replicas?

db-20 strips out the network and timer noise so we can focus on questions 2–4 alone. The simplification turns out to be the whole point: once you stop worrying about elections, the integration story fits in ~400 lines of Rust.

Design choices and trade-offs

Snapshot push instead of next-index walk

Raft's real conflict resolution is "decrement next_index, retry". For our purposes that produces the same final state as a one-shot snapshot push, but it forces us to model RPC round-trips. We pick the snapshot push because:

it converges in a single step (deterministic), and
it makes heal() trivial to write — just truncate and replay.

The cost: we cannot study log-divergence scenarios where two leaders both append. That's fine: this lab is single-leader by construction.

State machine is `BTreeMap<String, Vec<u8>>`

A sorted map gives free deterministic iteration in Rust and C++. Go's map has randomised iteration, so the Go implementation explicitly sorts before serialising. This is the single most common source of non-determinism in cross-language ports — every wire-format-aware function in the Go code does sort.Slice or sort.Strings.

Op encoding inside `LogEntry` is not wire-stable

The log is in-memory only; we never serialise LogEntry itself. Cross-language byte identity is only required at the snapshot boundary. This separation of "internal" and "wire" structures is cheap discipline that scales to real systems.

`current_term` is in the snapshot but is always 1

We expose current_term in the wire format anyway, plumbed through to all three implementations. This makes it cheap to add elections later (e.g. as an extension exercise) without having to bump the magic.

Failure-mode catalogue (what we covered, what we did not)

Failure	db-17 covers?	db-20 covers?
Single follower crash + catchup	yes	yes (heal)
Network partition isolating minority	yes	yes
Leader crash + new election	yes	no (fixed leader)
Split-brain after partition heal	yes	no (no elections)
Log compaction / snapshot install	scratched the surface	no
Disk-loss / log truncation	no	no
Byzantine behaviour	no	no

Where to take this next

broader-ideas.md lists the explicit extensions: linearizable reads, log compaction, multi-region replicas, learner replicas, snapshot install over the wire, gossip-style cluster membership.
The exam in cross_test.sh doubles as a regression net for any of those extensions — break the snapshot bytes, you break the build.

db-20 — Execution Plan

Stage 1 — Single replica, no replication

Implement Replica and apply() for Op::{NoOp, Put, Del} in Rust. Verify that a Cluster::new(1) (1-replica cluster — trivially has its own quorum) can propose(Put("a", b"v")) and the state machine sees a → b"v". Test cases 1 and 3 in the Rust suite.

Stage 2 — Five replicas, no failures

Add Cluster, propose, quorum = N/2 + 1. Verify that a single propose on a 5-replica cluster applies to all five state machines because all 5 follow the leader. Tests 2 and 6.

Stage 3 — Partitions

Add Cluster::partition(ids) and is_partitioned. Drop messages to and from partitioned replicas. Test that:

3/5 reachable still commits (test 4),
2/5 reachable does not commit (test 3),
partitioned followers have commit_index == 0 after one proposal (test 4).

Stage 4 — Heal + catchup

Implement Cluster::heal and Replica::truncate_and_replay. Verify that after a sequence of mutations on healthy replicas, calling heal() brings the partitioned ones back to byte-identical snapshots. Tests 5 and 13.

Stage 5 — Canonical snapshot

Decide the wire format (see CONCEPTS.md), implement dump_state, write the byte-format test that pins every field offset (test 8). The test fails loudly if a future refactor changes endianness or field order.

Stage 6 — Workload driver

Port splitmix64 (mix and stateful generator). Decode each r1 high-bit pair into Put/Del. Encode r3 % 10000 as a fixed 8-byte LE value so the byte width is independent of host word size. Tests 9, 10.

Stage 7 — Cross-language exam

Build the Rust binary, capture the actual hash for scenarios A and B, bake those hashes into src/go/dkv20_test.go, src/cpp/tests/test_dkv20.cc, and scripts/cross_test.sh. Port Go. Port C++. Run bash scripts/cross_test.sh and watch all three values align.

Stage 8 — Verification + docs

scripts/verify.sh runs all three test suites. scripts/cross_test.sh runs all three binaries on both scenarios. Doc trio (analysis, execution, observation, verification, broader-ideas) plus three steps/ study files.

Pitfalls to expect

Symptom	Likely cause
Go scenario hash doesn't match Rust	unsorted `map` iteration in `DumpState`
C++ scenario hash doesn't match Rust	endian / size mismatch in `put_u32_le` / `put_u64_le`
C++ tests pass in Debug, fail in Release	`assert(side_effect)` — Release strips it
Wrong commit_index after partition heal	snapshot push not clearing `state_machine`
Build error: duplicate `package` declaration	`create_file` leftover from a stub
Subagent left half-built ports	resume manually, hash will tell you if it works

db-20 — Observation

Frozen exam hashes

Scenario	Args	sha256
A	`--seed 42 --ops 500 --keys 32 --scenario default`	`1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b`
B	`--seed 7 --ops 2000 --keys 128 --scenario partition`	`272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc`

These three statements are all asserted by scripts/cross_test.sh:

Rust, Go, and C++ each produce the hash above for the given scenario.
All five replicas in the cluster produce the identical snapshot after the scenario completes (TestScenarioBReplicasConverge in Go, test_scenario_b_frozen in C++, scenario_b_partitioned_replicas_converge_after_heal in Rust).
The cluster has no live partitions when the driver returns.

Quantitative observations

metric	scenario A	scenario B
ops	500	2000
`keys` parameter	32	128
committed-on-leader entries	500	2000
approximate Put / Del fraction	3/4 vs 1/4	3/4 vs 1/4
live keys in final state (approx)	< 32	< 128

Every committed entry executes exactly once on the leader; partitioned followers see all of them after heal() because truncate_and_replay replays the entire log.

Behavioural observations

Convergence is deterministic. No timeouts, no clocks. Running the workload twice with the same seed always yields the same bytes (workload_determinism in Rust, TestWorkloadDeterminism in Go, test_workload_deterministic in C++).
Sub-quorum proposals leave uncommitted tails. The leader's last_idx advances every propose, but commit_index only advances on quorum acks. This is observable in the test sub_quorum_does_not_commit — the leader sees last_idx == 1 and commit_index == 0.
Heal is a one-shot. In scenario B, after the heal call at ops*3/4, all five replicas have byte-identical state machines. There is no period of eventual consistency — convergence is instantaneous and deterministic by construction.
Delete is real. Del removes the key from the state machine. A later Put reusing the same key is a fresh entry, not a "revive". This is asserted by test_del_removes_key (C++) and friends.

Performance notes (this lab is not a perf study)

The reference implementations are single-threaded, in-memory, with no I/O. Scenario B runs in ~5 ms in Release Rust on an M-series Mac; the snapshot push during heal() is O(log_size) per partitioned follower, which is the dominant cost.

The lab deliberately optimises for clarity and byte-identity, not throughput. Real systems (db-09 leveldb-complete is a good adjacent reference) batch and pipeline replication; here every propose is synchronous.

db-20 — Verification

Three layers of test

1. Per-language unit tests

File	Tests	Covers
`src/rust/src/lib.rs` `mod tests`	14	splitmix64, sha256, single replica, quorum, sub-quorum, partition, heal, convergence, del, byte format, determinism, scenarios A/B, snapshot push, NoOp
`src/go/dkv20_test.go`	15	same set + an extra stdlib sha256 cross-check
`src/cpp/tests/test_dkv20.cc`	13	same set

Run with:

( cd src/rust && cargo test --release )
( cd src/go && go test ./... )
( cd src/cpp/build && cmake --build . && ctest --output-on-failure )

scripts/verify.sh is the one-shot wrapper for all three and ends with === OK ===.

2. Cross-language byte-identity exam

scripts/cross_test.sh builds three clusterctl binaries (Rust, Go, C++) and runs:

clusterctl workload --seed 42 --ops 500  --keys 32  --scenario default
clusterctl workload --seed 7  --ops 2000 --keys 128 --scenario partition

The script asserts: rust_hash == go_hash == cpp_hash == golden_hash for each scenario. Ends with === ALL OK ===. Failure on any line exits non-zero and prints the diverging hashes.

3. Frozen golden hashes baked into source

The golden values are duplicated in three places on purpose:

scripts/cross_test.sh
src/go/dkv20_test.go (hashA, hashB constants)
src/cpp/tests/test_dkv20.cc (string-literal in test_scenario_a_frozen and test_scenario_b_frozen)

A change to the wire format or workload spec must update all three to keep verify + cross_test green. The redundancy makes silent drift impossible.

Sanity-check invariants (asserted by the tests)

Sha256Hex(empty) and Sha256Hex("abc") match the canonical SHA-256 test vectors.
Splitmix64Mix(0) == 0x8b57dafca0cee644 in all three languages.
DumpState for one Put produces exactly 38 bytes whose layout is pinned byte-by-byte.
NoOp advances commit_index but leaves the state machine empty.

What this exam does NOT verify

Real persistence (no log file, no fsync).
Real elections (leader is fixed; current_term == 1).
Real RPC failure injection (we model partitions only).
Linearizable read paths (reads are direct map lookups).

Those are deliberate scope cuts — see analysis.md and broader-ideas.md.

db-20 — Broader Ideas

The lab is deliberately small. Here is the menu of extensions that keep the same cross-language exam structure intact.

Elections

Add a term bump path. Replace fixed leader_idx = 0 with a leader elected by randomised timeout (or a deterministic priority list, if you want to keep cross-language byte identity). The snapshot already serialises current_term, so the wire format does not need to change. New invariant to test: after a leader change, every healthy replica still converges to the same snapshot.

Linearizable reads

Currently reads are direct state_machine.get(k). To make them linearizable, gate every read through a "read index" — leader confirms it is still leader by exchanging heartbeats with a quorum, then returns the value at commit_index. The byte-identity exam stays the same; you add a TestReadIndexBlocksUntilQuorum-style scenario.

Log compaction / snapshot install

Today heal() ships the entire log. For long-running clusters that is unbounded. Add Replica::compact(up_to_idx) that drops the prefix and records a CompactedSnapshot at the head; change try_append to also accept "follower has snapshot_idx == prev_idx - delta". The exam scenarios still pass because the applied state is unchanged.

Multi-key transactions

Replace Op::Put with Op::Txn(Vec<KeyOp>) and apply atomically. This is a small, well-scoped extension that nudges the lab toward db-13 (transactions and MVCC) territory.

Membership changes

Add a JointConsensus op (Raft §6) that switches the cluster's quorum during a configuration change. Trickier — the snapshot needs to include the active config — but a worthy follow-on if you want to see why "just add a node" is a real problem.

Disk persistence

Persist the log to a file (use db-01's pwrite-and-fsync pattern). Test crash recovery by tearing down a replica and reconstructing it from the log file. The snapshot bytes do not change.

Learner replicas

Add a replica role that receives entries but does not count toward quorum. Useful for read scaling. The snapshot bytes do not change.

Gossip-style membership

Replace the static replica list with a SWIM-style gossip layer that discovers and evicts replicas. Far more invasive — at this point you are building etcd.

Bridges to other labs in the repo

Extension	Builds on which other lab?
Disk persistence	db-01 (storage primitives), db-03 (WAL)
Linearizable reads	db-16 (distributed fundamentals)
Multi-key transactions	db-13 (transactions and MVCC)
Compaction / snapshot	db-09 (leveldb), db-21 (storage advanced)
Real elections + RPCs	db-17 (raft)
Multi-region / quorum mix	db-22 (perf & benchmarking)

Step 01 — Cluster and Replica

Goal: in ~30 minutes, build a single-replica Cluster::new(1) that accepts Put / Del and returns a state machine you can inspect.

What to read first

CONCEPTS.md § "Data model" and § "Propose / commit cycle".
db-17-raft/CONCEPTS.md for the words log entry, commit index, quorum if they are not yet second nature.

Concrete tasks

Define OpKind, Op, LogEntry, Replica, Cluster in the language of your choice. Match the field layout from src/rust/src/lib.rs.
Implement Replica::last_idx, Replica::try_append, Replica::advance_commit_to. Note that apply(state_machine, op) is the only place where state_machine mutates outside of truncate_and_replay.
Implement Cluster::new(n), quorum, propose. For now, treat every reachable follower as a successful append (no NACK path yet).

Definition of done

#![allow(unused)]
fn main() {
let mut c = Cluster::new(1);
assert!(c.propose(Op::Put("a".into(), vec![1,2,3])));
assert_eq!(c.leader().state_machine.get("a"), Some(&vec![1,2,3]));
}

equivalents pass in Go and C++. Run cargo test single_replica_put_commits to confirm.

Common bugs at this stage

Forgetting to bump next_log_idx so two proposals get the same idx.
Applying op before the entry is committed.
Iterating an unsorted map somewhere (Go) — even at this stage, start the habit of sort.Strings(keys) before any deterministic output.

Step 02 — Quorum Replication

Goal: extend the cluster to 5 replicas and make the leader commit only when a majority acks.

What to read first

docs/execution.md Stages 2 and 3.
The propose loop in src/rust/src/lib.rs (the part that calls try_append and counts acks).

Concrete tasks

Implement Cluster::partition(ids) and is_partitioned. Store partitioned ids in a sorted set so iteration order is stable.
In propose, count acks only from non-partitioned, non-leader replicas (plus one for the leader if the leader itself is not partitioned). If acks < quorum, return false and leave the entry uncommitted.
Write the three quorum tests:
- 5/5 reachable → commit on all.
- 3/5 reachable → commit on the reachable three.
- 2/5 reachable → no commit anywhere.

Definition of done

#![allow(unused)]
fn main() {
let mut c = Cluster::new(5);
c.partition(&[2, 3, 4]);
assert!(!c.propose(Op::Put("k".into(), b"v".to_vec())));
assert_eq!(c.leader().last_idx(), 1);
assert_eq!(c.leader().commit_index, 0);
}

passes. The Go and C++ ports must match.

Common bugs at this stage

Counting partitioned followers' acks anyway (a follower in the partitions set must contribute zero acks).
Counting the leader twice (once for i == leader_idx, once for acks = 1).
Advancing commit_index on the leader but not on the followers.
Mutating state_machine before the commit check passes.

Step 03 — Partitions and Catchup

Goal: implement heal() and truncate_and_replay so a partitioned follower can rejoin and converge. Then ship the cross-language exam.

What to read first

CONCEPTS.md § "Partition + heal".
docs/execution.md Stages 4–7.
src/rust/src/lib.rs — the truncate_and_replay and Cluster::heal bodies.

Concrete tasks

Implement Replica::truncate_and_replay(leader_log, leader_commit): replace own log, wipe state machine, replay committed prefix.
Implement Cluster::heal():
- clone leader log + commit index up front (avoid use-after-mutate),
- clear partitions,
- for each previously-partitioned follower in ascending id order, call truncate_and_replay.
In propose, when try_append returns false (gap), do a snapshot push immediately and count the ack.
Implement dump_state per the wire format in CONCEPTS.md. Pin every byte offset in a test (test_snapshot_byte_format).
Port the workload driver (run_workload) to all three languages. The byte-decoding rules — kind = (r1 >> 62) & 0x3, k = "k" + …, v = u64_le(r3 % 10000) — must be identical across all three.
Build Rust binary, run scenarios A and B, capture the hashes, bake them into Go test, C++ test, and scripts/cross_test.sh.
Bring Go and C++ green: run scripts/cross_test.sh. It must end with === ALL OK ===.

Definition of done

bash scripts/verify.sh        # → "=== OK ==="
bash scripts/cross_test.sh    # → "=== ALL OK ==="

Both scenarios produce:

A: 1febc1252f87f873c315526e9d9c78a622131d700dccca84a6e089244930252b
B: 272af5b41b729896a7195a6ea72d19111a96a50b29d5d4cdfaac03a058e1a2dc

Common bugs at this stage

heal() reads leader.log after mutating a follower — use a snapshot variable.
dump_state in Go iterates the map directly → randomised hash. Fix: sort the keys.
dump_state in C++ uses strcpy(magic_buf, "DSEDKV20") and copies 9 bytes including the NUL. Fix: std::memcpy(buf, MAGIC.data(), 8).
C++ test passes in Debug, fails in Release because assert(c.propose(...)) got stripped. Fix: assign to a bool ok = ... first, then assert(ok).
CLI prints a trailing newline. The exam compares full lines; a trailing \n breaks the hash comparison.

Advanced Storage Engine

Lab status: complete. All unit tests pass; scripts/cross_test.sh proves three independent implementations (Rust, Go, C++) produce byte-identical canonical wire dumps for three fixed workloads.

1. What Is It

A standalone study of two pieces that turn a textbook LSM tree into something closer to RocksDB / LevelDB strength:

Range tombstones — a single record that logically deletes every key in a half-open interval [start, end), instead of writing one Delete per key.
Compaction policies — size-tiered (Cassandra/Scylla heritage) and Universal (RocksDB's flagship), expressed as deterministic, side-effect- free functions over the sequence of SSTs.

Everything else (block cache, bloom layout, manifest, WAL, MVCC) is held at its simplest possible form so the two ideas above are studied in isolation.

2. Why Care

Range tombstones make DELETE FROM t WHERE id BETWEEN ?, ? and TTL-style "drop everything older than X" affordable. Without them, a one-million-key range delete writes one million Delete entries — and worse, blocks all subsequent reads until those tombstones reach the bottom level.
Compaction policies are the single biggest knob in any LSM. Size-tiered minimises write amplification at the cost of read amplification; Universal bounds the SST count while preserving recency. Choosing one is choosing the workload shape you'll be good at.

3. Core Data Structures

Type	Purpose
`Entry { Put(k,v) \| Delete(k) }`	The point-write unit.
`RangeTomb { start, end }`	Half-open interval delete; `start ≤ k < end`.
`SimpleBloom (u64)`	64-bit single-word bloom; two FNV-1a positions.
`Sst { smallest, largest, entries, range_tombstones, bloom }`	An immutable run.
`LsmTreeAdvanced { ssts (newest first), ratio }`	The whole tree.

Sst::size() = entries.len() + 2 * range_tombstones.len(). The ×2 weight on tombstones is deliberate: it makes compaction more eager when tombstones pile up, which matches real-world tuning advice.

4. The Six Algorithms in One Page

Build SST. Walk pending entries right-to-left, mark each key's last occurrence as keep. Then walk left-to-right emitting kept entries (preserves insertion order of survivors). Compute smallest/largest and the bloom in the same left-to-right pass. Range tombstones are copied verbatim.
Get(key). Walk SSTs newest → oldest, accumulating active tombstones into a vector as we go. For each SST:
- append its tombs to active,
- if any tomb in active covers key, return None (early exit),
- if bloom misses, continue (bounds and entries skipped, but older SSTs may still contain covering tombs — so the walk continues),
- if key < smallest or key > largest, continue for the same reason,
- linear scan entries; first hit returns Some(v) for Put, None for Delete.
Size-tiered compaction. Pick the longest prefix L ∈ [2, n-1] such that Σ size(ssts[0..L]) ≤ ratio · size(ssts[L]). Merge that prefix into one SST and insert it at the newest position (index 0). If no such L exists, return false.
Universal compaction. Pick the longest contiguous run [i, i+L) with L ≥ 3 such that Σ size(run) ≤ ratio · size(ssts[i+L]) (i.e. the run is followed by something at least twice as big). Ties broken by smaller i. Merge the run, replace it in place.
Merge. Walk the run newest → oldest. For each SST:
- copy its range tombs into out_tombs (preserved verbatim),
- for each entry: skip if seen[key] (newer-wins), skip if covered by any previously active tomb, otherwise keep it; mark seen,
- append the SST's range tombs to active.
Finally sort out_entries by key for determinism and recompute the bloom + bounds.
Dump (canonical wire format). A length-prefixed binary blob, little- endian throughout. Magic "DSEADV21" ‖ f64 ratio ‖ u32 sst_count ‖ per SST: lenpref(smallest) ‖ lenpref(largest) ‖ u32 nE ‖ entries (u8 kind ‖ lenpref(key) ‖ if Put: lenpref(val)) ‖ u32 nT ‖ tombs (lenpref(start) ‖ lenpref(end)) ‖ u64 bloom.

5. What's Deliberately Not Here

No WAL — recovery is out of scope; the tree is in-memory.
No block cache, no separate index/filter blocks — the bloom is one u64.
No level structure — SSTs are a flat list, newest first.
No snapshots / MVCC — Get is point-in-time only.
No concurrency — everything is single-threaded; SSTs are immutable so reads-with-merges would be trivially safe.

These omissions are why the lab fits in three files per language while still exercising the two ideas (range tombstones, compaction policy) at the depth where their subtleties bite.

6. Pointers to Cross-Language Equivalence

The whole point of the lab is that three independent implementations agree byte-for-byte, not just at API level. The shared deterministic workload (SplitMix64 PRNG, three draws per op, fixed flush/compact cadence) and the canonical wire format (Section 4.6) are the two halves of that contract. scripts/cross_test.sh enforces it with three hard-coded sha256 fixtures.

See docs/execution.md for the format spec, docs/verification.md for the expected output of the verification scripts, and docs/analysis.md for the design forces behind both range tombstones and the two compaction policies.

References — db-21

The lab is intentionally self-contained, but the ideas are not original.

Range Tombstones

RocksDB wiki, "DeleteRange: A New Native RocksDB Operation" https://rocksdb.org/blog/2018/11/21/delete-range.html
CockroachDB blog: "DeleteRange and the importance of tombstones in a distributed SQL database."
"Bringing Modern Hierarchical Memory Systems Into Main-Memory Databases" (Bortnikov et al.) — discusses interval-deletion structures.

Compaction Policies

"The Log-Structured Merge-Tree (LSM-Tree)", O'Neil et al., 1996. The canonical paper; introduces the size-tiered idea.
RocksDB wiki, "Universal Compaction Style" https://github.com/facebook/rocksdb/wiki/Universal-Compaction
RocksDB wiki, "Leveled Compaction" https://github.com/facebook/rocksdb/wiki/Leveled-Compaction
ScyllaDB docs, "Size-tiered compaction strategy (STCS)" — the Cassandra heritage of size-tiered.

SplitMix64

Steele, Lea, Flood, "Fast Splittable Pseudorandom Number Generators", OOPSLA 2014. The mixing constants 0x9E37..., 0xBF58..., 0x94D0... come straight from this paper.

FNV-1a

Glenn Fowler, Landon Curt Noll, Phong Vo, "FNV non-cryptographic hash." The 64-bit offset 0xCBF29CE484222325 and prime 0x100000001B3 are the standard FNV-1a parameters.

Cross-Language Byte Equivalence as a Methodology

TigerBeetle's "Tigerstyle" notes on deterministic simulation.
FoundationDB's flow-based testing — the closest commercial analogue to "spec-by-hash-of-canonical-dump".

Analysis — db-21 Advanced Storage Engine

1. Problem Statement

Two engineering questions, studied in isolation:

Range deletes. How does an LSM delete a key range [a, b) cheaply, without writing one Delete per key, and without losing correctness if a newer Put falls inside the same range?
Compaction policy. How do size-tiered and Universal compaction actually differ — not as marketing words, but as deterministic functions over the current SST sequence?

The lab refuses to answer these with prose alone. It demands an executable specification that three language ports must agree on byte-for-byte.

2. Why Three Languages

Cross-language byte agreement is the cheapest sanity check that survives refactoring. If Rust drifts from Go on fixture A, the failure tells you exactly which side broke: the diff between the two dump_state() blobs is a structured binary, decodable by eye.

It also forces the design through three different idiom sets:

Rust keeps Option<Vec<u8>> for Get, enum Entry { Put, Delete } for the entry kind, and uses Vec<u8> everywhere for keys/values. Slices for bounds; no copies in merge_run's hot path.
Go uses []byte plus bytes.Compare. A map[string]struct{} stands in for the dedupe set. math.Float64bits for the ratio encoding.
C++ uses std::string as a byte container (avoids the char_traits trap), std::optional<std::string> for Get, and an inline 64-line SHA-256 in lsmctl.cc to keep the dependency surface at zero.

If you can read the same algorithm in all three and they line up at the byte level, the algorithm is unambiguous. That's the deliverable.

3. Range Tombstones — The Subtlety

The single non-obvious rule is:

A range tombstone hides keys older than it, but is itself hidden by Puts newer than it.

Both halves matter. Test range_tomb_respects_newer_put exists because a naive implementation that consults all tombstones before walking entries will silently drop the fresh value.

The implementation enforces this by walking SSTs newest → oldest and accumulating active tombstones as the walk descends. A Put in a newer SST is checked against the (then still empty) active set, so it survives. A Put in an older SST is checked against the (by then populated) active set, so it is hidden.

This also explains why a bloom miss must continue instead of return None: the SST we just skipped might have zero matching keys, but it could still contribute a range tombstone that shadows something below it. The active set must be allowed to grow.

4. Size-Tiered vs Universal — The Real Difference

Both are "merge several SSTs into one". The difference is which several.

Size-tiered asks: "is there a prefix [0..L) of new, small SSTs that together fit within ratio · size(ssts[L])?" It picks the longest qualifying prefix, merges them, and inserts the result at the newest position. This is greedy from the top of the tree.
Universal asks the same shape of question, but over a contiguous run anywhere in the list, with a minimum run length of 3. It picks the longest run; ties go to the leftmost. The merged run replaces itself in place.

The minimum lengths (≥ 2 prefix for tiered, ≥ 3 run for universal) are deliberate, both to keep work amortised and to make the two policies distinguishable on small inputs. Without them, both would degenerate to "merge whenever you can" and the fixtures wouldn't separate them.

5. Why the Wire Format Looks Like That

Five choices, each with a reason:

Magic "DSEADV21" — eight bytes, no length prefix. Mismatches surface as the first 16 hex chars of the sha256 changing, which is easy to spot.
f64 ratio — encoded via the IEEE 754 bit pattern, not as a string. This is why all three languages route through f64::to_bits, math.Float64bits, and memcpy(&u64, &double, 8). Strings would force a formatter choice ("0.5" vs "0.5000000000000000").
Length-prefixed keys/values — u32 LE lengths, raw bytes. No terminator, no escaping. Decoding is a one-pass scan.
Entries newest-SST-first — matches the in-memory layout; reversing it in the dump would obscure the actual data structure.
Bloom as raw u64 LE — not a list of positions. The bitmap is the bloom; nothing else needs to be portable.

6. Trade-offs Not Taken

We did not implement snapshot reads. Every Get is "as of right now".
We did not deduplicate range tombstones across SSTs at merge time. A range that fully contains an older range still leaves both in the merged output. Real engines coalesce; we don't, because the canonical-bytes test would then depend on a chosen normalisation policy.
We did not gate compaction on actual work performed; size-tiered may pick a length-2 prefix even when the merge produces zero entries (after tombstones erase everything). That's a feature for the study lab — it exercises the merge code; in production you'd skip empty merges.

Execution — db-21 Wire Format and Workload

This document is the single source of truth for the canonical wire format and the deterministic workload. Anything ambiguous here is a bug; fix the doc, not the implementations.

1. SplitMix64 PRNG

state += 0x9E3779B97F4A7C15
z      = state
z      = (z XOR (z >> 30)) * 0xBF58476D1CE4E5B9
z      = (z XOR (z >> 27)) * 0x94D049BB133111EB
return   z XOR (z >> 31)

All multiplications are unsigned 64-bit, wrapping on overflow. The PRNG is seeded with the user-supplied 64-bit seed. Three draws happen per op, even when only one or two are used — keep them in order (r1, r2, r3).

2. Operation Selection

op = (r1 >> 62) & 0b11

`op`	Action
0, 1	`Put(k = "k" + (r2 mod keys), v = u32_be(r3 as u32))`
2	`Delete(k = "k" + (r2 mod keys))`
3	`RangeTomb(start = "k" + a, end = "k" + (a + 1 + (r3 mod (keys-a))))` where `a = r2 mod keys`

In scenario ptonly, op 3 is rewritten to op 0 before the action runs. The three draws still happen.

The value bytes are the big-endian 32-bit representation of r3 truncated to 32 bits. (Big-endian because it produces visually distinct bytes across fixtures; the format is otherwise little-endian.)

3. Flush and Compact Cadence

Every 8 ops (i.e. when (op_idx + 1) % 8 == 0): flush all pending entries and tombstones into a new SST at the newest position.
Every 16 ops (i.e. when (op_idx + 1) % 16 == 0): run one compaction pass appropriate to the scenario (size-tiered, universal, or no-op).
No residue flush at end. If the loop ends with non-zero pending entries, they are discarded. This is intentional: it keeps the cross-language hashes stable regardless of ops mod 8.

4. Canonical Wire Format

All integers little-endian. lenpref(b) means u32 LE len(b) ‖ b.

"DSEADV21"               (8 bytes, ASCII, no terminator)
f64 LE ratio             (IEEE 754 bit pattern, not a string)
u32 LE sst_count
for each SST (newest first):
    lenpref(smallest_key)
    lenpref(largest_key)
    u32 LE entry_count
    for each entry:
        u8 kind                 (Put = 1, Delete = 2)
        lenpref(key)
        if kind == Put: lenpref(value)
    u32 LE range_tomb_count
    for each range tombstone:
        lenpref(start_key)
        lenpref(end_key)
    u64 LE bloom_bitmap

5. The Three Canonical Fixtures

Captured from the Rust reference and pinned in scripts/cross_test.sh:

Fixture	seed	ops	keys	scenario	sha256 of dump
A	42	200	32	`tieredcompact`	`fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6`
B	7	500	64	`universalcompact`	`05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb`
C	99	300	16	`withrange`	`4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4`

If you change anything about the workload or the wire format, these hashes change. That's the contract: the hashes are intentional padlocks on behavioural drift.

6. lsmctl CLI

lsmctl workload --seed S --ops N --keys K --scenario {ptonly|withrange|tieredcompact|universalcompact}

Prints the lowercase hex sha256 of dump_state() followed by a newline. Exit code is 0 on success, 2 on argument errors. All three ports must agree on stdout byte-for-byte for the same arguments.

7. Reproducing the Hashes

cd db-21-storage-engine-advanced
./scripts/verify.sh     # all unit tests
./scripts/cross_test.sh # cross-language byte equivalence

Expected last line: === ALL OK ===.

Observation — db-21

1. The Three Hashes

A  seed=42  ops=200  keys=32  tieredcompact     fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
B  seed=7   ops=500  keys=64  universalcompact  05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
C  seed=99  ops=300  keys=16  withrange         4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4

All three languages produce all three hashes on the first run after each clean build. This was not a happy accident — it required keeping every sneaky source of nondeterminism out of the merge step:

HashSet iteration order doesn't leak (we sort out_entries by key after the merge, and we never serialise the seen set).
Map ordering doesn't leak (Go uses a map[string]struct{} for dedupe but never iterates it; entries come out of a slice).
Floating-point comparison doesn't leak (the ratio is 0.5 exactly, which is a representable f64; Σ size ≤ ratio · size is integer-vs-rational with no rounding ambiguity at this scale).

2. What Bit Us During Development

Two-pass size-tiered. An early draft computed prefix_sum once to pick chosen, then recomputed it inside the merge call. The two passes drifted under refactoring. Fixed by collapsing to a single pass that updates prefix_sum inline.
Go math.Float64bits. Initial Go draft tried to avoid the math import by writing a wrapper chain (float64bits → float64bitsFallback → math_Float64bits). The chain was broken (no math import to define the leaf). Lesson: don't fight the standard library for ceremony.
C++ std::optional<std::string> for Get. Worth the friction versus a sentinel value: a Put of the empty string is distinguishable from absent, which is testable in dedup_keeps_last.

3. What We Didn't Observe (and why that's good)

No platform endianness surprises. macOS arm64 produced the same hashes the canonical fixtures pin. The explicit LE encoding in every put-int helper means we'd survive a big-endian port too.
No f64 rounding drift. The ratio is 0.5 and the sizes are small integers; nothing forces denormals or transcendental math.
No SHA-256 mismatch. The Rust port uses an inline impl in lsmctl.rs; the Go port uses crypto/sha256; the C++ port uses the 64-line public-domain reference at the bottom of adv.cc. Three independent SHA-256 implementations agreeing on three hashes is the cheapest possible end-to-end test.

4. Resource Profile

Each cargo build --release takes ~5s cold. go build ~1s. cmake --build ~3s. cross_test.sh from cold runs in ~10s including all three builds. No external network, no Docker, no system packages beyond a working C++20 toolchain, Go ≥ 1.22, and Rust stable.

Verification — db-21

1. What "Verified" Means Here

Two distinct claims:

Per-language correctness: unit tests in each language pass.
Cross-language byte equivalence: three independent implementations produce identical canonical wire dumps for three fixed workloads, proven by sha256.

Both must hold. (1) without (2) lets each port drift independently into a "self-consistent but wrong" state.

2. Per-Language Unit Tests

Ten tests, mirrored across all three ports:

#	Name	Asserts
1	`bloom_hit_miss`	Bloom positive case + a definite negative
2	`bounds_short_circuit`	`Get` skips SST when key outside `[smallest, largest]`
3	`range_tomb_hides_older_put`	Newer range tomb shadows older Put
4	`range_tomb_respects_newer_put`	Older range tomb does not shadow newer Put
5	`tiered_picks_prefix`	`compact_size_tiered` picks ≥2 prefix
6	`universal_picks_run`	`compact_universal` picks ≥3 contiguous run
7	`noop_compaction`	Returns `false` when no eligible group
8	`dump_determinism`	Two dumps of the same state are equal; magic is `DSEADV21`
9	`workload_all_scenarios`	All four scenarios produce non-empty dumps with correct magic
10	`dedup_keeps_last`	`build_sst` keeps the last Put per key

./scripts/verify.sh
# == Rust ==
# 10 passed; 0 failed
# == Go ==
# ok      github.com/10xdev/dse/db21
# == C++ ==
# 1/1 Test #1: test_adv .........................   Passed
# === OK ===

3. Cross-Language Byte Equivalence

./scripts/cross_test.sh
# == build Rust ==
# == build Go ==
# == build C++ ==
# ok   fixture=A impl=rust fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok   fixture=A impl=go   fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok   fixture=A impl=cpp  fc2fe88978eb2d419a73a7a16fa9ec0695ad9a56cb3a31b0bf85c0a28d7c97d6
# ok   fixture=B impl=rust 05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok   fixture=B impl=go   05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok   fixture=B impl=cpp  05b07426e0da8ec2f1f8c81573dc275cd61cab9c19c93dc17c854456e441e7bb
# ok   fixture=C impl=rust 4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# ok   fixture=C impl=go   4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# ok   fixture=C impl=cpp  4ad255755dbfbaa40a842766656d0c0dbd6713b6a527ffea5a24fa35964d73e4
# === ALL OK ===

4. What Would Falsify The Claim

A non-exhaustive list of bugs the cross test would catch but a per-language test wouldn't:

Forgetting to encode the bloom bitmap as little-endian on a big-endian port.
Using host integer width for length prefixes instead of u32.
Iterating a hash map at any point in merge_run (non-deterministic order across languages and across runs).
Encoding the ratio as "0.5" instead of the IEEE bit pattern.
Compacting via "longest run found so far that satisfies threshold at the time of finding", instead of evaluating all runs and picking the global longest.
Off-by-one in b = a + 1 + (r3 mod (keys-a)) for the range tombstone end key.

5. Reproducibility Bar

macOS arm64, AppleClang 16, Go 1.22, Rust stable (rustc 1.7x).
No external dependencies (no sha2 crate, no golang.org/x/..., no OpenSSL): every implementation is self-contained, so the verification step is reproducible offline.
All three hashes are pinned in scripts/cross_test.sh and reproduced in this document for paper-trail purposes.

Broader Ideas — db-21

A short scrapbook of "what would I add next if this were a real engine?"

1. Tombstone Garbage Collection

Right now a range tombstone lives forever — it survives every compaction and is copied verbatim into the merged output. A production engine drops a tomb when it's certain no shadowed Puts remain below it. The standard test: the tomb's end_key is < smallest_key of every SST below it. Implementing this would require tracking the "sequence number" or generation of each record, which we deliberately omitted.

2. Coalescing Overlapping Tombstones

Two tombs [k0, k5) and [k3, k7) are equivalent to [k0, k7). Merging them at compaction time shrinks the per-Get cost (the active vector stays smaller). We didn't do it because the canonical-bytes test would then need to specify a normalisation policy (sort by start? coalesce overlaps? coalesce adjacencies?). Each choice is fine, but the choice itself becomes part of the wire format.

3. Multi-Level Layout

The lab keeps SSTs as a flat list. RocksDB has L0 (overlapping ranges allowed, newest writes land here) plus L1..Ln (each level non-overlapping, ratio'd in size). Universal compaction roughly corresponds to a degenerate "L0 only" mode, while leveled compaction is its own beast (each compaction picks one L_i SST and the L_{i+1} SSTs that overlap it). A natural follow-up would implement leveled compaction and re-run the same three fixtures with new hashes.

4. Bloom Quality

A 64-bit single-hash bloom is intentionally bad — it exists to make the test for "bloom misses still must walk older SSTs for range tombstones" trigger reliably on tiny inputs. Real engines use per-SST blooms sized to ~10 bits per key with k≈7 hash functions, giving a false-positive rate ~1%. The change is purely numeric; the wire format would absorb a longer bitmap as a length-prefixed byte string.

5. Snapshot Reads / MVCC

If each entry carried a seq: u64, Get(key, at_seq) would walk SSTs the same way but only consider entries with entry.seq ≤ at_seq. Range tombstones would gain a seq too. The merge step would need to keep older versions until they're below the oldest live snapshot.

6. Why Not Implement These Now?

Each one would multiply the size of the wire format and the surface area of the cross-language tests. The lab's claim is that two ideas (range tombstones, two compaction policies) are enough to stress-test cross- language byte equivalence to the point of being convincing. Adding a third without first writing it down somewhere else would dilute the lesson.

Step 01 — Range Tombstones

Goal

Implement a single record that logically deletes every key in [start, end) without writing one Delete per key, and prove the priority rules with two adversarial tests.

What to Build

A RangeTomb { start_key, end_key } value type with a covers(key) predicate (key >= start && key < end).
An Sst carries a Vec<RangeTomb> alongside its Vec<Entry>.
LsmTreeAdvanced::get walks SSTs newest → oldest, accumulating active tombstones into a local vector as it goes.

The Two Rules That Matter

A range tombstone hides keys older than it — i.e. in SSTs that appear later in the newest-first walk.
A range tombstone does not hide keys newer than it — i.e. in SSTs that appear earlier in the walk.

The Two Tests That Pin Them

range_tomb_hides_older_put: newer SST has tomb [k0, k5), older SST has Put(k3, "hello"). get("k3") must return None.
range_tomb_respects_newer_put: newest SST has Put(k3, "fresh"), middle SST has tomb [k0, k5), oldest SST has Put(k3, "stale"). get("k3") must return Some("fresh").

Subtlety: Bloom Misses

When the bloom of an SST says "key not here", you cannot return early from get. The skipped SST might contribute a range tombstone that would shadow something below. So a bloom miss continues the walk; only a range-tombstone match early-exits with None.

Done When

Both tests above pass in all three languages.
The range_tombstones are present in dump_state per Section 4 of docs/execution.md, and the three canonical fixtures still match.

Step 02 — Tiered and Universal Compaction

Goal

Implement two compaction policies as deterministic functions of the current SST sequence and the configured ratio, so that the same input list of SSTs always produces the same output list.

Size-Tiered

Pick the longest prefix L ∈ [2, n-1] of ssts such that Σ size(ssts[0..L]) ≤ ratio · size(ssts[L]).

chosen     = 0
prefix_sum = 0
for L in 1..=n-1:
    prefix_sum += size(ssts[L-1])
    if L >= 2 and prefix_sum <= ratio * size(ssts[L]):
        chosen = L
if chosen < 2: return false
merged = merge_run(ssts[0..chosen])
ssts   = [merged] ++ ssts[chosen..]
return true

The merged SST goes at the newest position, because that's where the newly-merged data conceptually lives.

Universal

Pick the longest contiguous run [i, i+L) with L ≥ 3 such that Σ size(run) ≤ ratio · size(ssts[i+L]) (i.e. the run must be followed by something at least 1/ratio times its total size). Ties broken by smaller i.

best_i, best_L = -1, 0
for i in 0..n:
    if i + 3 >= n: break
    run_sum = 0
    for L in 1..=n-1-i:
        run_sum += size(ssts[i+L-1])
        follow = i + L
        if follow >= n: break
        if L >= 3 and run_sum <= ratio * size(ssts[follow]):
            if L > best_L: best_i, best_L = i, L
if best_L == 0: return false
merged = merge_run(ssts[best_i..best_i+best_L])
ssts   = ssts[..best_i] ++ [merged] ++ ssts[best_i+best_L..]
return true

Merge Semantics (shared by both)

Walk the run newest → oldest:

Append the SST's range tombs to out_tombs.
For each entry:
- skip if seen[key] (newer-wins),
- skip if covered by any tomb in active,
- otherwise emit; mark seen.
Append the SST's range tombs to active (so they apply to older SSTs in the run).

After the walk, sort out_entries by key (for determinism across hash-map iteration orders) and recompute smallest, largest, bloom.

Why the Minimum Lengths

Tiered's L ≥ 2 keeps it from being "merge one SST with nothing", which would just rewrite the SST.
Universal's L ≥ 3 is RocksDB's actual choice; smaller runs are too frequent to amortise the I/O.

Done When

tiered_picks_prefix passes (size-tiered selects the 3-small-SST prefix in front of a big SST and produces a 2-SST result).
universal_picks_run passes (universal selects the 3-small run between two big SSTs).
noop_compaction passes (both policies return false on a 2-SST tree).
All three canonical fixtures still match after this step.

Step 03 — Cross-Language Byte Equivalence

Goal

Prove that the Rust, Go, and C++ implementations produce byte-identical canonical wire dumps for three fixed workloads.

Why This Is The Whole Point

API-level test parity is cheap and weak. "Same input → same hash of a canonical binary dump" is strong: any per-language drift (endian, integer width, map-iteration order, float formatting) surfaces as a hash mismatch on the next run.

The Format (one canonical source)

See docs/execution.md Section 4. Two-line summary:

Magic "DSEADV21" ‖ f64 LE ratio ‖ u32 LE sst_count.
Per SST (newest first): bounds (lenpref) ‖ entries (u8 kind + lenpref key + maybe lenpref value) ‖ range tombs ‖ u64 LE bloom bitmap.

The Workload (one canonical source)

See docs/execution.md Sections 1-3. Two-line summary:

SplitMix64 PRNG, 3 draws per op, (r1 >> 62) & 3 chooses Put / Put / Delete / RangeTomb. Flush every 8 ops, compact every 16. No residue flush at end.

The Three Fixtures

Fixture	seed	ops	keys	scenario
A	42	200	32	`tieredcompact`
B	7	500	64	`universalcompact`
C	99	300	16	`withrange`

Hashes are pinned in scripts/cross_test.sh and reproduced in docs/execution.md Section 5 and docs/verification.md Section 3.

Done When

./scripts/cross_test.sh
# ... ends with ...
=== ALL OK ===

If it doesn't, the diff between two implementations' dumps is the debugging artefact. Decode the first ~16 bytes to confirm magic + ratio, then walk SSTs one at a time — each SST is self-delimiting.

What To Do When A Hash Drifts

Recapture from Rust. If you intentionally changed semantics, the Rust reference dictates the new canonical hashes; update both scripts/cross_test.sh and docs/execution.md Section 5.
Hunt the drift. If you didn't intend to change anything, diff the raw dump_state bytes between the failing pair. The first differing byte tells you where in the format the bug lives. Common culprits: forgot LE, used usize instead of u32, iterated a hash map.

db-22 — Performance and Benchmarking

Why this lab exists

Benchmarks lie. They mostly lie because a benchmark answers a different question than the one you thought you were asking. db-22 is a small, self-contained system whose only purpose is to be measured: a keyed in-memory counter store driven by a deterministic synthetic workload. We freeze a wire format and a workload, hash the resulting state across three implementations (Rust, Go, C++), and use the resulting binary identity as the load-bearing definition of "the same program."

Once correctness is cross-language identical, we can compare throughput of the three implementations on the same hardware honestly — and we can talk about what does and does not constitute a fair comparison.

The data structure: `CounterStore`

A CounterStore is a mapping i64 -> u64 (counters) plus a single total_ops: u64 running counter. Three operations:

incr(k, by): total_ops += 1, counters[k] += by (entry created with value by if missing).
decr(k, by): total_ops += 1. If k is missing the call has no further effect (total_ops was already incremented). Otherwise:
- if current <= by, the entry is removed (saturating decrement);
- else counters[k] = current - by.
get(k) -> Option<u64>: live lookup, returns None if absent.

There are no tombstones. Removed counters leave no trace in the snapshot. This is intentional and matters: it makes the snapshot a pure function of the final live state, not of the history of operations.

The semantic that total_ops is bumped on every call (including no-op decr on missing) is the simplest invariant and is the one against which all golden hashes were computed. Changing it would change every hash.

Wire format: `dump_snapshot`

The snapshot is a function CounterStore -> bytes whose output must be byte-identical across all three implementations.

offset  size  field
------  ----  ---------------------------------------------
0       8     magic "DSEBENCH" (ASCII)
8       8     total_ops (u64 little-endian)
16      4     distinct_keys (u32 little-endian)
20+     16    one row per key, ascending by key:
              - 8 bytes: key (i64 little-endian)
              - 8 bytes: count (u64 little-endian)

Ordering is the only subtle bit. Rust uses BTreeMap, whose iteration is naturally ascending. Go uses a plain map and explicitly sorts the keys before emitting. C++ uses std::map, also ascending. All three converge on the same byte sequence.

The workload: deterministic by construction

We need a workload that:

Is identical across languages.
Exercises a mix of insert / mutate / delete to produce a non-trivial end state.
Can be scaled in both ops and keys.

We use SplitMix64 for randomness. It is small, fast, has trivially portable arithmetic (just u64 adds, shifts, multiplies, and xors), and needs no library. The constants and step function are well-known:

state += 0x9E3779B97F4A7C15
z = state
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
z = (z ^ (z >> 27)) * 0x94D049BB133111EB
return z ^ (z >> 31)

Each workload iteration draws exactly three u64 words. Drawing the same number every iteration is what keeps the RNG stream identical across languages even when one branch is short and another is long.

r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind = (r1 >> 62) & 0x3        # 0,1,2 → Incr ; 3 → Decr  (3:1 ratio)
k    = (r2 % keys) as i64
by   = (r3 % 100) + 1          # 1..=100

Three-to-one incr:decr means the counter map grows in expectation, but with keys small relative to ops the map fills up and the decrement path actually deletes entries — both code paths get exercised in any non-trivial run.

Two frozen scenarios

Scenario	seed	ops	keys	SHA-256 of snapshot
A	42	500	32	`4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015`
B	7	5000	256	`5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1`

scripts/cross_test.sh builds all three binaries and asserts that the hashes match each other and these golden values.

Common cross-language divergence sources (and how we avoid them)

Map iteration order. We never iterate HashMap. We sort keys explicitly (Go) or use BTreeMap/std::map (Rust, C++).
Integer promotion in shifts. u64-only arithmetic. No mixed signed/unsigned shifts.
% semantics for negative operands. r2 is u64; modulus and cast to i64 happen exactly once and in the same order.
size_t width. We only put u32/u64 on the wire, never size_t directly.
Trailing whitespace / newlines in CLI output. hash prints the hex with no trailing newline. bench writes its line to stderr so it can never pollute stdout that a script might be hashing.

Bench methodology

benchctl bench runs a short warm-up (ops/10 + 1) to pull pages and populate caches, then a single timed pass over ops calls. It prints ops, keys, elapsed_us, ops_per_sec, and distinct (the number of live counters at the end) to stderr.

This is intentionally crude — the workload is a single thread doing in-memory map operations. It is good enough for "is the Rust build twice as fast as the Go build?" type questions; it is not a microbenchmark replacement for criterion / go test -bench / Google Benchmark. The references in references.md cover the deeper rabbit hole.

What you actually learn from this lab

Why a benchmark needs to fix a deterministic workload before it fixes a metric.
Why "the same program in two languages" needs a binary equality test, not a "looks the same" code review.
Why bench harnesses must warm up, isolate stdout/stderr, and avoid hidden allocations inside the timed region.
Why HashMap iteration order is a footgun for portable wire formats.

References — db-22

Primary sources on benchmarking

Brendan Gregg. Systems Performance: Enterprise and the Cloud, 2nd ed., Addison-Wesley, 2020. The canonical modern reference. Chapter 12 ("Benchmarking") is required reading; the "active benchmarking" methodology and the catalog of common mistakes (cold-cache effects, the wrong saturation point, the wrong unit) frame the entire lab.
Brendan Gregg. BPF Performance Tools, Addison-Wesley, 2019. Less directly relevant here but the right book if you want to observe what your benchmark is actually doing on a Linux box.
Gil Tene. "How NOT to Measure Latency." Strange Loop 2015. The "coordinated omission" talk. Even on an in-memory benchmark like this one, the principle generalizes: the metric you report has to match the question the user is asking. We intentionally report ops_per_sec, not p99 latency, because a single-threaded synchronous loop does not have an interesting tail.
Bryant & O'Hallaron. Computer Systems: A Programmer's Perspective, 3rd ed., Pearson, 2015. Chapter 5 ("Optimizing Program Performance") and Chapter 9 ("Virtual Memory") supply the "always measure one level deeper" instinct used throughout the docs.

Determinism, RNGs, and reproducible benchmarks

Sebastiano Vigna. "An experimental exploration of Marsaglia's xorshift generators, scrambled." ACM TOMS, 2014. SplitMix64 and friends. Justification for using SplitMix64 here: it has trivially portable arithmetic and a well-defined byte-identical output across languages.
Guy Steele, Doug Lea, Christine Flood. "Fast Splittable Pseudorandom Number Generators." OOPSLA 2014. The paper that introduced SplitMix.

Microbenchmarking pitfalls (per-language)

Andrey Akinshin. Pro .NET Benchmarking, Apress, 2019. Despite the .NET framing, chapters 1–4 are language-agnostic gold: warm-up, steady state, the dead-code-elimination trap, JIT vs AOT timing.
Aleksey Shipilëv. "JMH samples" and his "Nanotrusting the Nanotime" blog post. Java-specific but the lessons are universal — particularly the discussion of System.nanoTime resolution traps, which apply equally to std::chrono::steady_clock and Go's time.Now().
Rust: criterion documentation, especially the section on outlier detection.
Go: the testing package's Benchmark docs and Dave Cheney's "Five things that make Go fast".
C++: Google Benchmark and Chandler Carruth's CppCon talk "Tuning C++".

Cross-language byte-equality engineering

The Cap'n Proto encoding spec. A worked example of a wire format designed for cross-language stability. We do not use Cap'n Proto here, but its constraints (fixed-width little-endian, no sentinel ordering ambiguity, no implicit string normalization) are the same constraints we impose on dump_snapshot.
Go issue #7986 — map iteration is intentionally randomized. Read the issue and the surrounding discussion; this is the canonical worked example of why a portable wire format may never iterate a hash map without an explicit sort.

Background reading on what "fast" means

Latency Numbers Every Programmer Should Know (the Peter Norvig / Jeff Dean table). Internalize the ratios. The point of the bench harness is to put your numbers somewhere on this chart.
Ulrich Drepper. "What Every Programmer Should Know About Memory." Long and old but still the right tour of the memory hierarchy your bench is actually hitting.

Analysis — db-22

What we are actually trying to do

The brief was "performance and benchmarking." That is a topic, not a problem statement. The first design pass turned it into a problem statement:

Build a tiny system that has one knob (a deterministic workload) and one measurable property (throughput on that workload), then implement it in three languages and use byte-identical correctness as the precondition for any speed claim.

Everything else in the lab follows from that constraint.

Constraints I started with

Three languages must produce the same bytes for the same inputs. Without this, "Rust is faster than Go on this workload" is unfalsifiable — they might just be doing different work.
No external dependencies for the core data structure. SHA-256 has to be reimplementable from scratch in C++ (no OpenSSL), SplitMix64 has to be reimplemented in all three. This is the same discipline used in db-15 and is the only way to guarantee bit-identity.
The workload must be small enough to brute-force-test for determinism, but large enough that a 1% difference in implementation efficiency shows up in the bench numbers. The two scenarios (500 ops / 32 keys and 5000 ops / 256 keys) bracket this.
The bench harness must not contaminate the correctness harness. Throughput numbers go to stderr; the hex hash goes to stdout with no trailing newline. A shell script can $(benchctl hash ...) cleanly.

Data structure choice: counter store, not KV store

I considered an mvcc KV store (like db-15), a small B-tree, or even reusing db-20's distributed KV. All three are overkill for what we want to measure here. A i64 -> u64 counter store is:

Small enough to fit in ~400 LOC per language.
Big enough to exercise the map implementations of each language (BTreeMap, map, std::map).
Has interesting cross-language failure modes (HashMap iteration order, signed/unsigned subtraction, integer-width truncation in serialization).
Has a workload that genuinely creates and destroys entries, so the map's resize / rebalance / erase code paths all execute.

The saturating-decrement decision

The choice about what decr(k, by) does when by > current or when k is missing is the most consequential semantic decision in the lab. I considered three options:

No-op on missing, total_ops unchanged. Cleaner in some ways but makes total_ops a partial counter: you cannot replay the operation stream and recover the same total_ops. Rejected.
Underflow / panic on negative. Would force the workload generator to remember which keys are live, defeating the determinism. Rejected.
Saturating: bump total_ops, then either remove the entry or subtract. Total_ops always tracks the operation stream. Snapshots are pure functions of the final state, not the operation history. This is what we picked.

The cost is that "decrement past zero" is silently lossy. For a benchmark workload that is the right trade.

Why three RNG draws per iteration

A subtle correctness footgun: if some branches consume fewer RNG words than others, the RNG stream diverges from a different implementation that happens to evaluate the branches in a different order. Drawing all three words before branching means every iteration consumes the same number of RNG bytes regardless of kind. This makes the workload trivially portable.

SplitMix64 over xoshiro / pcg / etc.

SplitMix64 has the smallest state (one u64) and the simplest step (one add, three multiplies, four xors, three shifts). Its only operations are 64-bit integer ops that all three languages handle identically with no surprises. Anything fancier is a footgun for cross-language byte-equality with no upside for a synthetic workload.

Wire format design notes

Little-endian everywhere. ASCII magic so a hexdump -C is human-readable. Length prefix (distinct_keys) so a reader could in principle parse the snapshot incrementally — we never actually do this in the lab, but the format is forward-compatible.

We do not embed keys or ops or seed in the snapshot. The snapshot is purely about the resulting state; the workload that produced it is metadata.

Bench harness design

Four decisions:

One pass, one timing region. No statistical machinery. The exercises that need percentiles or distributions go to criterion / go test -bench / Google Benchmark — not this harness.
Warm-up sized as ops/10 + 1. A small constant warm-up (+ 1 handles ops < 10) that pulls cache lines and triggers allocator first-touch. Empirically this stabilizes the second-pass timing to within a few percent run-to-run.
Bench output to stderr. Lets benchctl bench and benchctl hash use the same flag layout and lets shell scripts redirect them independently.
distinct in the output. It's a sanity check: if the bench reports distinct=0, your workload is collapsing entries faster than it creates them and your throughput number is measuring deletes, not inserts. (See observation.md for the actual numbers.)

What I'd do differently with more time

Add a third scenario that intentionally has heavy contention on a single key (small keys, large ops) to make the bench numbers more sensitive to allocator behavior.
Wire the bench harness to also produce a flamegraph hint (elapsed_us bucketed per operation kind).
Add a --profile flag that runs the workload twice and reports the ratio, as a cheap "is this stable?" check.

Execution — db-22

How to run everything

# from db-22-performance-and-benchmarking/
bash scripts/verify.sh        # runs Rust, Go, and C++ unit tests
bash scripts/cross_test.sh    # builds 3 binaries, asserts byte-identical hashes

Both scripts end with === OK === or === ALL OK === respectively. They exit non-zero on any mismatch.

Per-language invocation

Rust

cd src/rust
cargo test --release --lib tests              # 9 tests
cargo build --release
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./target/release/benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default

The --release profile is important: the debug build of SplitMix64 is substantially slower because the multiplies aren't inlined.

Go

cd src/go
go test ./...                                 # 9 tests
go build -o /tmp/benchctl_go ./cmd/benchctl
/tmp/benchctl_go hash workload --seed 42 --ops 500 --keys 32 --scenario default
/tmp/benchctl_go bench workload --seed 1 --ops 100000 --keys 1024 --scenario default

C++

cd src/cpp
mkdir -p build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./test_db22                                   # 9 tests
./benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default

The CMake file defaults to Release with -O3 -DNDEBUG. The test binary #undefs NDEBUG so its assertions are not stripped.

What the scripts do, step by step

`scripts/verify.sh`

cd into the lab root.
Run the Rust unit tests under cargo test --release --lib tests. We pass --lib tests so cargo only loads the test module from the library crate; without it cargo prints "0 tests" because it tries to discover integration test binaries that don't exist.
Run the Go unit tests with go test ./....
Configure and build the C++ project under src/cpp/build and run ./test_db22.
Print === OK === on success.

`scripts/cross_test.sh`

Build all three release binaries.
For each of two frozen scenarios, run benchctl hash workload … in all three implementations, capture stdout (no trailing newline), and compare:
- The three implementations must agree with each other.
- They must agree with the golden hash committed in the script.
Print === ALL OK === on success.

CLI shape

benchctl hash  workload --seed N --ops N --keys N --scenario S
benchctl bench workload --seed N --ops N --keys N --scenario S

hash prints the SHA-256 hex digest of the final snapshot, with no trailing newline, on stdout.
bench writes one line to stderr describing the run; its stdout is empty.

Both commands accept identical flags. --scenario is currently a documentation tag — it does not change behavior but is reserved for future workload variants.

Reproducing the frozen hashes

$ ./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015

$ ./target/release/benchctl hash workload --seed 7 --ops 5000 --keys 256 --scenario default
5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1

If you ever see a different hash:

Did you change MAGIC, the wire format, the workload mixing rule, or SplitMix64? Any of those will move every hash.
Did you change the decrement semantics? See analysis.md.
Are you iterating a HashMap or unordered_map instead of a sorted structure? That will give you a random hash run to run.

Observation — db-22

Cross-language hash check

All three implementations agree on the bytes:

=== scenario A ===
rust: 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
go  : 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
cpp : 4b72eab6cbc773ac9584104c5923a5139b34ab466052bdb8ceacb087c06a9015
match + golden ok
=== scenario B ===
rust: 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
go  : 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
cpp : 5c35e7b1507834fda4960246640e6fb0b194b75b9593bec87159eafcbc3876a1
match + golden ok

Throughput probe (single representative run)

ops=100000 keys=1024 elapsed_us=7242 ops_per_sec=13806910 distinct=1024

About 13.8 million ops/sec for the Rust release build on a single thread, single core, no contention, on an Apple Silicon laptop. distinct=1024 tells us the map is fully populated at the end of the run — the increment-heavy mix means decrements rarely empty a slot at this keys cardinality.

Read this as: each op costs roughly 70 nanoseconds, of which a chunk is three SplitMix64 draws, a couple of map lookups, and the per-iteration loop overhead. It is in the right ballpark for an in-memory BTreeMap<i64, u64> workload.

What we are not measuring (and why that matters)

No allocator pressure beyond the initial map growth. The map reaches steady state after ~keys distinct entries are touched, and the rest of the run is in-place mutation.
No I/O, no syscalls, no real memory pressure. The whole working set fits in L2.
No latency distribution. We report a single throughput number. For a single-threaded synchronous loop, p99 latency would just be a rephrasing of throughput plus a small jitter from the OS scheduler.
No cross-language throughput numbers in this doc. You can collect them yourself with benchctl bench — but be honest about what you've measured (one machine, one moment, one workload).

Why the bench number is stable but not authoritative

The bench subcommand runs a small warm-up pass (ops/10 + 1) before the timed pass. On the order of 100k ops the warm-up is about 10k operations, which is enough to pull all the map slots and K256 SHA constants into the right caches. Without the warm-up the first pass is ~30% slower; with the warm-up, second-pass timings repeat to within a few percent run-to-run.

This is still a crude harness. We are not collecting CPU counters, we are not pinning to a CPU, we are not disabling turbo, we are not controlling for thermal state. Use these numbers for ordering ("did this change make it faster or slower?") and not for absolute claims ("Rust does N nanoseconds per op on this machine").

Sanity checks that fire if you break things

scenario_a_frozen / scenario_b_frozen — any change to wire format, mixing rule, or RNG step breaks both of these immediately.
splitmix64_known — guards against accidental constant-swap in the SplitMix64 mixing function.
sha256_vectors — guards against accidental damage to the SHA implementation in any language.
snapshot_layout_two_keys — pins the exact byte layout of a trivial 2-key snapshot, so a wire-format change shows a tightly localized failure (not just "scenario A differs").
workload_determinism — same seed/ops/keys gives the same bytes on two consecutive runs.

Verification — db-22

What "verified" means here

For a perf-and-bench lab, "verified" means three things at once:

All three implementations pass their own unit tests (Rust 9, Go 9, C++ 9).
All three implementations produce byte-identical snapshot hashes for both frozen scenarios.
The frozen hashes match the golden values committed in source.

Anything less and the bench numbers are meaningless. You can't claim "Rust does X ops/sec on this workload" if it is not doing the same work as the Go and C++ versions.

How to verify

bash scripts/verify.sh
bash scripts/cross_test.sh

Each script exits non-zero on failure and prints either === OK === or === ALL OK === on success.

Expected last lines:

$ bash scripts/verify.sh
…
=== OK ===

$ bash scripts/cross_test.sh
…
=== ALL OK ===

What each unit test pins

Test	Pins
`sha256_vectors`	SHA-256 against known empty and "abc" vectors
`splitmix64_known`	`splitmix64(0) == 0x8b57dafca0cee644`
`incr_accumulates`	`incr` adds to existing entries, creates new ones, bumps `total_ops`
`decr_saturates_and_removes`	decrement past zero removes the entry
`decr_on_missing_is_visible_op`	decr on a missing key bumps `total_ops` but does not create the entry
`snapshot_layout_two_keys`	exact wire bytes of a 2-key snapshot
`workload_determinism`	same seed twice → same snapshot bytes
`scenario_a_frozen` / `scenario_b_frozen`	frozen golden hashes per scenario

The frozen-scenario tests are the highest-value tests in the lab. Any silent change to the wire format, the workload, or SplitMix64 breaks both of them with a clear "got X, want Y" message in the failing language's test output.

Manual sanity checks

# bytes of the smallest meaningful snapshot
./target/release/benchctl hash workload --seed 0 --ops 0 --keys 1 --scenario default
# expected: sha256 of MAGIC || 0_u64 || 0_u32 = the empty-store hash

# determinism
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
./target/release/benchctl hash workload --seed 42 --ops 500 --keys 32 --scenario default
# should print the same hex twice

What is not verified by these tests

That bench reports the correct throughput. It is impossible to verify a wall-clock number from a test. The bench harness has a distinct= field as a structural sanity check, but the numeric throughput is left to the operator to inspect.
That the implementations are equally fast — we only check they are equally correct. The whole point of the lab is to make speed comparisons honest by first making correctness identical.
That the implementations would still match on a 32-bit or big-endian platform. The wire format pins little-endian; on a hypothetical big-endian build we'd need a byte-swap in put_u64_le etc.

Broader Ideas — db-22

The lab as it stands is a deliberately minimal harness. These are extensions that would build naturally on top of it.

A. Percentile-aware bench harness

Replace the single-pass timer with a per-operation timing loop that collects a histogram (HDR-style) of per-op latencies. Then bench reports p50 / p90 / p99 / p99.9 in addition to throughput. This is where the Gil Tene "How NOT to Measure Latency" talk earns its keep — even on a synchronous single-thread loop, a long-tail GC pause in Go or a page fault in C++ will move the tail dramatically.

Trap to avoid: the cost of taking a timestamp per op (time.Now() / std::chrono::steady_clock::now()) is itself ~30 ns on most boxes, which is comparable to one workload op. You'd need to time batches of ops and divide.

B. Allocator pressure scenario

Add a third scenario whose workload is deliberately allocator-heavy: short-lived strings as values (move from u64 to String), or a churn pattern that constantly creates and removes keys so the map is forced to resize. The cross-language throughput delta for this scenario would be much larger than for the existing one, and the results would speak to the maturity of each language's allocator.

C. Multi-threaded variant

Wrap CounterStore in a sync primitive and run N workers. The point is not to demonstrate scaling — Mutex<BTreeMap<…>> won't scale — but to demonstrate the difference between coarse locking, sharded locking, and lockfree updates. Each language has different idioms here (parking_lot vs std::sync, sync.Map vs atomic, std::shared_mutex vs std::atomic), and the cross-language comparison becomes a language-features comparison.

D. Snapshot replay / log shipping

Right now dump_snapshot produces bytes that are only used for hashing. Add a restore_snapshot and a small "log" of operations (just the sequence of (op, k, by) triples), and you have a tiny replicated store. Connect three nodes via a deterministic schedule and you have a toy version of db-23.

E. Energy and not-time metrics

On Apple Silicon, powermetrics --samplers cpu_power can give you energy per op. The relative energy of the Rust / Go / C++ implementations on the same workload is a more honest "which is more efficient" claim than throughput, because it folds in stalls, branch mispredictions, and memory bandwidth.

F. Comparison with off-the-shelf benchmark frameworks

Run the same workload under criterion (Rust), go test -bench, and Google Benchmark (C++). Compare:

Their reported throughput vs ours.
Their reported variance.
The shape of their output.

The lab's homegrown harness will look crude in comparison, and that's the point — the exercise of measuring the difference is more educational than the difference itself.

G. Worst-case scenario discovery

Use coverage-guided fuzzing on the workload generator (with the saturating-decrement invariant as the asserted property) to find a seed/ops/keys combination that maximizes either throughput or memory pressure. This connects perf work to the fuzz/property-test discipline used in db-13 and db-15.

H. Cross-architecture verification

Run the existing scripts/cross_test.sh under qemu-user-static for aarch64 / x86_64 / riscv64 and confirm the hashes still match. They should — the wire format is little-endian and the arithmetic is all 64-bit — but the only way to be sure is to actually do it.

I. Cache-aware redesign of `CounterStore`

std::map / BTreeMap / sorted-Go-slice all use pointer-rich tree structures. A flat sorted array with binary search would be slower for insert but dramatically faster for the iteration step (which is the critical path in dump_snapshot). For a workload that touches each key only a handful of times before snapshotting, the array would be worth measuring.

J. The "ten percent rule"

A small operational rule we picked up doing this lab: any perf change worth claiming must move the bench number by more than ten percent. Below that, run-to-run noise on a laptop dominates. Above that, you can usually attribute the change to a specific code path. The harness is deliberately not precise enough to defend a 2% claim, and that's a feature.

Step 01 — Counter Store

Goal

Implement a CounterStore in each of three languages with byte-identical semantics for incr, decr, and get. The data structure is intentionally small — three operations, two pieces of state — so we can focus on the edge cases that make cross-language byte-identity hard.

What to build

A type/struct/class CounterStore with:

An ordered map i64 -> u64 (BTreeMap, sorted-keys map, std::map).
A u64 running counter total_ops.
incr(k, by): total_ops += 1; add by to (or create) counters[k].
decr(k, by): total_ops += 1; if k is missing, stop. Otherwise remove the entry if by >= current, else subtract.
get(k) -> Option<u64> / (u64, bool) / std::optional<u64>.

Tests this step should pass

incr_accumulates: three incrs across two keys leave the right per-key values and total_ops == 3.
decr_saturates_and_removes: incr(1, 5); decr(1, 3); decr(1, 100) leaves the map empty with total_ops == 3.
decr_on_missing_is_visible_op: decr(42, 1) on an empty store leaves total_ops == 1 and no entry for 42.

Things to watch for

u64 underflow: never compute current - by without the current <= by check first.
Go's map: a missing key reads back as the zero value with ok=false. Use the comma-ok form explicitly.
C++ std::map::operator[]: avoid it on the read path — it inserts a zero entry as a side effect. Use find.

Acceptance

cargo test --release --lib tests::incr_accumulates and the matching Go / C++ tests all pass.

Step 02 — Snapshot and Workload

Goal

Pin a wire format for CounterStore and a deterministic workload generator so that, given identical (seed, ops, keys), all three implementations produce the same bytes — and therefore the same SHA-256 digest.

What to build

`dump_snapshot`

A byte serializer with this exact layout:

"DSEBENCH"  (8 bytes, ASCII)
total_ops   (u64 little-endian)
distinct_keys (u32 little-endian)
for each key in ascending order:
    key (i64 little-endian)
    count (u64 little-endian)

Critical details:

Ascending iteration order. BTreeMap / std::map are already sorted; Go must call sort.Slice on the keys explicitly.
Little-endian for every integer.
No padding, no separators, no trailing bytes.

SplitMix64

Implement the standard one-state-word SplitMix64:

state += 0x9E3779B97F4A7C15
z = state
z = (z ^ (z >> 30)) * 0xBF58476D1CE4E7B5
z = (z ^ (z >> 27)) * 0x94D049BB133111EB
return z ^ (z >> 31)

Also implement the stateless splitmix64(x) (without the state += step) for the canonical test vector check.

`run_workload(seed, ops, keys, scenario)`

rng = SplitMix64(seed)
store = empty CounterStore
repeat ops times:
    r1 = rng.next()
    r2 = rng.next()
    r3 = rng.next()
    kind = (r1 >> 62) & 0x3       # 0,1,2 → incr, 3 → decr
    k    = i64(r2 % keys)
    by   = (r3 % 100) + 1
    if kind == 3 -> store.decr(k, by) else store.incr(k, by)
return store.dump_snapshot()

The scenario argument is reserved and ignored for now.

Tests this step should pass

sha256_vectors: empty and "abc" SHA-256 vectors.
splitmix64_known: splitmix64(0) == 0x8b57dafca0cee644.
snapshot_layout_two_keys: incr keys 2 and 1, snapshot is 52 bytes with magic, total_ops=2, distinct_keys=2, then the row for key 1 before the row for key 2.
workload_determinism: two runs of the same workload produce byte-identical snapshots.
scenario_a_frozen / scenario_b_frozen: hashes match the golden values in CONCEPTS.md.

Things to watch for

Always draw three RNG words per iteration, even if a branch only needs two. The RNG stream must be identical across languages.
Never iterate a hash map for serialization. Sort first.
Don't put size_t or usize on the wire — always serialize as u32 or u64.

Acceptance

scripts/cross_test.sh reports === ALL OK ===.

Step 03 — Bench Harness

Goal

Add a bench subcommand to benchctl in each language that runs the same workload as the hash subcommand and reports a throughput number. The harness should be small enough to read end-to-end but disciplined enough not to lie.

What to build

A bench workload --seed N --ops N --keys N --scenario S subcommand that:

Runs a warm-up pass of ops/10 + 1 operations and discards the result.
Captures a high-resolution start timestamp.
Runs the full ops workload and keeps the resulting CounterStore so we can read distinct from it.
Captures a high-resolution end timestamp.
Writes one line to stderr in this format:

ops=<N> keys=<N> elapsed_us=<N> ops_per_sec=<N> distinct=<N>

Writes nothing to stdout.

The CLI's hash subcommand must remain unchanged: stdout-only, no trailing newline, no diagnostic noise.

Timing primitives by language

Rust: std::time::Instant.
Go: time.Now() / time.Since().
C++: std::chrono::steady_clock.

steady_clock / Instant are the right choice — they are monotonic and not subject to wall-clock adjustments mid-run.

Tests this step should pass

There are no automated tests for bench (timing values can't be asserted), but the structural sanity check is:

./target/release/benchctl bench workload --seed 1 --ops 100000 --keys 1024 --scenario default
# expect on stderr:
# ops=100000 keys=1024 elapsed_us=<some number> ops_per_sec=<some number> distinct=1024
# expect on stdout: nothing

Things to watch for

Don't put printf inside the timed region. Allocating a string is ~hundreds of nanoseconds and will dominate small workloads.
Don't take a timestamp per op. The cost of Now() is comparable to the cost of one workload op.
Don't forget the warm-up. The first pass is dominated by cold-cache effects and first-touch allocator behavior.
Don't claim numbers across machines without describing the machine.

Acceptance

Running bench against a 100k-op, 1024-key workload produces a throughput line on stderr and an empty stdout. verify.sh and cross_test.sh continue to pass.

db-23 — Capstone: distributed replicated KV database

This is the final lab. It synthesizes everything from db-01 through db-22 into a single tiny but real distributed key/value database whose state is byte-identical across Rust, Go, and C++ for two frozen scenarios.

What this lab builds

A 3-node replicated KV cluster:

Node	Role
0	Leader. The only node that originates writes.
1	Follower. Can be taken down mid-run.
2	Follower. Always up.

Each write Op (Put or Del) is:

Drawn deterministically from a SplitMix64 stream (see db-04, db-22).
Appended to the leader's log at index log.len() + 1.
Replicated synchronously to every live follower.
Counted as ack'd by every live node whose log already contains that index (plus the leader itself).
Committed on every reachable node when the ack count reaches a majority of 3 (= 2).
Applied: each newly-committed entry mutates the local BTreeMap<i64, i64> state machine in commit-index order.

A catch_up operation lets a recovering follower copy any missing log entries from the leader and advance its commit/apply watermark.

Two scenarios — frozen hashes

The cluster snapshot is the canonical encoding of all three nodes concatenated. We hash it with SHA-256.

Scenario	seed	ops	keys	fault?	SHA-256
normal	42	200	16	no	`5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296`
fault	7	2000	128	follower 1 down on `[ops/2, 3·ops/4)`	`d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792`

Both hashes are frozen as constants in src/rust/src/lib.rs, src/go/db23_test.go, and src/cpp/src/db23.h, and cross-checked by scripts/cross_test.sh.

Deterministic workload

For op i the RNG draws three u64s regardless of branch outcome, so the stream is identical no matter which kind of op gets generated:

r1, r2, r3 = rng.next(), rng.next(), rng.next()
kind       = (r1 >> 62) & 0x3   // 0,1,2 -> Put,  3 -> Del
k          = i64(r2 % keys)
v          = i64(r3 % 1000)

The fault schedule is purely a function of ops:

down_start = ops / 2
down_end   = (ops * 3) / 4

At i == down_start follower 1 is marked down; at i == down_end it comes back up and we immediately catch_up. If the loop happens to end while follower 1 is still down, we catch it up once more at the end so all three nodes always converge.

Per-node canonical encoding

magic           : 8 bytes  = "DSEDIST2"
node_id         : u8
term            : u64 LE
commit_index    : u64 LE
log_len         : u32 LE
log[log_len] of:
    term        : u64 LE
    index       : u64 LE
    op_kind     : u8         (0 = Put, 1 = Del)
    key         : i64 LE
    value       : i64 LE     (0 for Del)
kv_len          : u32 LE
kv[kv_len] of (ascending by key):
    key         : i64 LE
    value       : i64 LE

The cluster snapshot is just node0.encode() || node1.encode() || node2.encode(). last_applied is not serialized — after a write loop completes (with terminal catch-up) it always equals commit_index, so it carries no extra information.

Sources of cross-language divergence — avoided

Risk	How we eliminate it
Map iteration order	Sort i64 keys ascending in Go (`sort.Slice`); `BTreeMap`/`std::map` already ordered in Rust/C++.
Endianness	All multi-byte ints written little-endian by hand.
RNG branch-skew	Always draw 3 words per op regardless of kind.
32/64-bit `int`	All wire types are u8/u32/u64/i64; sizes are explicit.
Apply order under fault	Apply is gated by a single monotonic commit-index counter, and `catch_up` is called at well-defined points.
`0` value for `Del`	C++/Go fill `v=0`; Rust matches with explicit `value()` returning `0` for `Del`.

What this synthesizes from prior labs

Earlier lab(s)	Used here as
db-01 storage primitives	Manual byte-level LE encoding.
db-02 data structures	Sorted map state machine.
db-03 write-ahead log	The per-node `log` is the WAL.
db-04 hashing	SHA-256 + SplitMix64 PRNG.
db-05/06/07/08 LSM stages	Replaced here by a simpler in-memory state machine, but the apply-log-then-mutate pattern is the same.
db-13 transactions	Atomic apply per committed entry (no partial state).
db-16 distributed fundamentals	Replication, majority quorum, follower catch-up.
db-17 raft	Leader-only writes, log indexing, commit watermark.
db-22 perf & bench	Deterministic workload + canonical snapshot pattern.

How to verify

bash scripts/verify.sh      # runs all 9 tests in all 3 languages
bash scripts/cross_test.sh  # confirms cross-lang + golden equality

Both must end with === OK === and === ALL OK === respectively.

References — db-23 capstone

Replication and consensus

Diego Ongaro & John Ousterhout. In Search of an Understandable Consensus Algorithm (Extended Version). ATC 2014. The Raft paper — the leader/log/commit-index model used by this lab is a direct simplification of it.
Leslie Lamport. Paxos Made Simple. 2001. The original majority-quorum log-replication algorithm.
Flavio Junqueira, Benjamin Reed, Marco Serafini. ZAB: High-performance broadcast for primary-backup systems. DSN 2011. Used by ZooKeeper; closest in spirit to the leader-only single-quorum model here.

Theory

Fischer, Lynch, Paterson. Impossibility of Distributed Consensus with One Faulty Process. JACM 1985. Why deterministic consensus needs failure detectors / partial synchrony.
Eric Brewer. Towards Robust Distributed Systems. PODC 2000 keynote (CAP conjecture). Gilbert & Lynch later proved it.
Seth Gilbert & Nancy Lynch. Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. SIGACT 2002.

Practitioner material

MIT 6.824 Distributed Systems lectures (esp. lectures 5–8 on Raft).
Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly 2017. Chs. 5, 8, 9 on replication, consistency, and consensus.
Kyle Kingsbury. Jepsen reports (https://jepsen.io). Practical examples of how real systems violate the guarantees their READMEs claim.

Isolation testing

Peter Bailis et al. Hermitage — concrete tests that expose what isolation levels really mean (https://github.com/ept/hermitage).

What this lab does not model

Leader election (we hardcode node 0 as leader forever).
Log truncation / divergent suffixes (we use synchronous in-process replication, so followers never have entries the leader lacks).
Membership changes, log compaction, snapshots, network partitions beyond a single follower being marked down.

Those are the natural follow-on projects after this capstone — see docs/broader-ideas.md.

Analysis — db-23 capstone

Goal restated

Build the smallest possible thing that is honestly a replicated KV database, port it to three languages with byte-identical state, and prove convergence under a deterministic failure scenario.

Design choices

Why synchronous in-process replication?

A real Raft cluster uses goroutines/threads, network RPC, election timeouts, and randomized jitter — all of which are sources of nondeterminism. For a capstone whose entire point is cross-language byte-equality, that would defeat itself.

So instead the "network" is a function call. The Cluster's submit synchronously: appends on the leader, appends on every live follower, and commits if quorum reached. This is provably equivalent to a Raft cluster running in lockstep with no message reordering.

Why majority = 2?

3 nodes, so a quorum is 2. The leader counts itself. As long as either follower is up, the cluster commits. When follower 1 is marked down, follower 2 + leader still form a quorum. If both followers were down simultaneously, writes would block — but our fault schedule never does that, so submit never wedges.

Why a single deterministic leader?

Leader election adds randomness (timeouts) and protocol surface (terms, RequestVote). We pin node 0 as the perpetual leader. The lab still shows the replication half of Raft faithfully; election is left as a follow-on (see broader-ideas.md).

Why three RNG draws per op, including for `Del`?

If we drew fewer words on Del branches, the RNG stream would advance differently for runs that happen to produce more Dels, and frozen hashes would depend on the kind distribution. By always consuming exactly three words we ensure the stream depends only on seed and ops, not on what kinds happened.

Why drop `last_applied` from the wire format?

After the final catch_up (which runs unconditionally if follower 1 ended down), every node satisfies last_applied == commit_index. Including it in the encoding would waste bytes and risk a Rust/Go/C++ divergence if one of them computed it slightly differently mid-run. It is a derived quantity, so we omit it.

Failure model

The only fault we inject is a single follower going down for one quarter of the run:

[0, ops/2)              all three nodes replicate
[ops/2, 3·ops/4)        follower 1 down; quorum is {0, 2}
[3·ops/4, ops)          follower 1 up + caught up; all three replicate
end                     final catch_up if still mid-down (handles ops%4)

This produces a clean, hashable post-condition: every node has the same log, the same commit_index, and the same kv map.

Why two scenarios?

normal (no fault) shows the happy path and stresses the commit path under a small workload.
fault (with the follower window) stresses replication under partial availability and the catch-up code path. The 2000-op size makes the fault window long enough to accumulate hundreds of entries that the recovering follower must replay.

Both must produce the same hash on all three languages.

Execution — db-23 capstone

Build & test

# everything
bash scripts/verify.sh

# cross-language identity check
bash scripts/cross_test.sh

verify.sh runs the 9-test suite in each of Rust, Go, and C++ and ends with === OK ===. cross_test.sh builds three dbctl binaries, runs both scenarios in each language, asserts equality across the three languages, and asserts each matches the frozen golden hash, then ends with === ALL OK ===.

Per-language one-liners

# Rust
( cd src/rust && cargo test --release --lib tests )
( cd src/rust && cargo run --release --bin dbctl -- \
    hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo )

# Go
( cd src/go && go test ./... )
( cd src/go && go run ./cmd/dbctl \
    hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo )

# C++
( cd src/cpp && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j )
src/cpp/build/test_db23
src/cpp/build/dbctl hash workload --seed 42 --ops 200 --keys 16 --scenario normal; echo

CLI shape

dbctl hash workload --seed N --ops N --keys N --scenario <normal|fault>

Prints the SHA-256 hex of the cluster snapshot to stdout.
Writes no trailing newline (matches db-22 convention so shell comparisons stay simple).
Exits 2 on bad arguments.

Frozen scenarios

Scenario	Command
normal	`dbctl hash workload --seed 42 --ops 200 --keys 16 --scenario normal`
fault	`dbctl hash workload --seed 7 --ops 2000 --keys 128 --scenario fault`

Expected outputs:

normal: 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
fault : d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792

Observation — db-23 capstone

What we measured during development

1. Log + commit_index advance lock-step on happy path

Three submits with no fault:

after submit Put(1,100): log=[1] commit=1 kv={1:100}  (all 3 nodes)
after submit Put(2,200): log=[1,2] commit=2 kv={1:100,2:200}
after submit Del(1):     log=[1,2,3] commit=3 kv={2:200}

Each submit returns synchronously with all three nodes already in the post-state. This is the put_then_del_replicates test.

2. Quorum still progresses with one follower down

Take follower 1 down between submits. Leader + follower 2 still form a quorum:

follower 1 down.
submit Put(2,2): leader.commit=2 follower2.commit=2 follower1.commit=1
submit Put(3,3): leader.commit=3 follower2.commit=3 follower1.commit=1

This is the fault_window_then_catchup_converges test. After catch_up(1):

follower1.log.len = 3, follower1.commit = 3, follower1.kv = {2:2, 3:3}

3. The snapshot is byte-identical across languages

For both frozen scenarios:

[normal] rust=5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] go  =5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] cpp =5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[normal] gold=5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
[fault]  rust=d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault]  go  =d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault]  cpp =d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792
[fault]  gold=d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792

This is what scripts/cross_test.sh prints on success.

4. Snapshot size for a 1-write cluster

8 magic + 1 id + 8 term + 8 commit + 4 log_len
+ 1 entry of (8+8+1+8+8) = 33
+ 4 kv_len + 1 kv of (8+8) = 20
= 82 bytes per node × 3 nodes = 246 bytes total

Verified in snapshot_layout_smoke tests in all three languages.

What we did not observe

Any divergence between languages, ever, on either scenario.
Any nondeterminism within a single language (each scenario run twice in the determinism tests).
Any case where a follower's log moved ahead of the leader — by construction, only the leader appends new entries; followers only ever copy.

Caveat

The cluster is in-process. We cannot observe real network behavior — no message loss, reordering, or partial partitions. The lab models replication semantics under controlled failures, not network robustness. The latter is left to the broader ideas / future work.

Verification — db-23 capstone

Acceptance criteria

#	Property	Where checked
1	SHA-256 implementation matches NIST vectors.	`sha256_vectors` test, all 3 langs.
2	SplitMix64 matches the known value `splitmix64(0)`.	`splitmix64_known` test, all 3 langs.
3	Happy-path Put/Del fully replicates and applies on every node.	`put_then_del_replicates` test, all 3 langs.
4	After a fault window + `catch_up`, all three nodes converge.	`fault_window_then_catchup_converges` test, all 3 langs.
5	Per-node snapshot layout is exactly 82 bytes for a 1-op cluster.	`snapshot_layout_smoke` test, all 3 langs.
6	The normal scenario is deterministic (two runs hash-equal).	`workload_is_deterministic` test, all 3 langs.
7	The fault scenario is deterministic.	`fault_scenario_is_deterministic` test, all 3 langs.
8	Normal scenario hashes to the frozen golden.	`scenario_normal_frozen` test, all 3 langs + `cross_test.sh`.
9	Fault scenario hashes to the frozen golden.	`scenario_fault_frozen` test, all 3 langs + `cross_test.sh`.

Each language runs its own copy of the 9 tests, so the suite total is 27 assertions of cross-cutting properties plus 6 hash-equality checks across languages (3 langs × 2 scenarios) in cross_test.sh.

How to run

bash scripts/verify.sh      # ends with === OK ===
bash scripts/cross_test.sh  # ends with === ALL OK ===

Failure-mode triage

Symptom	Likely cause
Rust passes, Go/C++ fails frozen test	Map iteration order — confirm Go sorts keys, confirm `std::map` (not `std::unordered_map`).
All three languages disagree on the same scenario	RNG-stream drift — check that `step_op` draws exactly 3 words regardless of kind.
Determinism test fails within one language	Some hidden non-determinism (HashMap, address ordering). Switch to ordered map.
Snapshot length wrong	Off-by-one in `log_len`/`kv_len` u32, or wrong endian.
Fault test fails only in C++	Probably `unsigned char` vs `char` in MAGIC comparison, or signed arithmetic on `i64`.

Frozen hashes (locked)

HASH_NORMAL = 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296
HASH_FAULT  = d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792

These constants live in src/rust/src/lib.rs, src/go/db23_test.go, and src/cpp/src/db23.h, and are also hard-coded in scripts/cross_test.sh. Changing any byte of the wire format requires regenerating all five copies in lock-step.

Broader ideas — what to build next

This capstone is a deliberately minimal replicated KV. Here are the natural follow-on projects, in roughly increasing scope:

1. Leader election

Replace "node 0 is leader forever" with a Raft-style election: randomized timeouts, terms, RequestVote, log-completeness check. Determinism becomes hard the moment timers exist, so frozen-hash testing must be replaced with invariant-style testing (e.g. "every successful read returns a value from the leader's committed log").

2. Real network

Move from synchronous function calls to in-memory channels first, then to TCP RPC, then to UDP with retransmission. At each layer add the corresponding failure injection (drop, reorder, duplicate, delay) and re-verify safety invariants.

3. Log compaction & snapshots

Today catch_up replays the entire leader log. For a long-running cluster this is infeasible; add Raft-style snapshots: leader sends a full kv state plus the index it represents, follower installs that, then resumes from lastIncludedIndex + 1.

4. Membership changes

Add a Reconfigure op that mutates the cluster set. Use the joint-consensus or single-server membership change algorithms.

5. Read consistency levels

Stale read: any follower answers from its local kv.
Read-your-writes: client reads from leader.
Linearizable read: leader confirms it is still leader via a heartbeat to a quorum before answering, or uses Raft's ReadIndex / lease read.

6. Multi-shard / sharded KV

Use a hash of the key to pick a shard; each shard is its own 3-node Raft group. Add a meta-shard that owns the shard map. This is the architecture of TiKV, CockroachDB, Spanner.

7. Transactions across shards

Layer 2PC (with a transaction coordinator log) over the shard groups. Or do Percolator-style snapshot isolation. Or go full Spanner with TrueTime.

8. Jepsen-style testing

Property-based testing with random clients, random faults (partitions, clock skew, node kills), and a linearizability checker (Knossos or Porcupine).

9. Replace the in-process state machine

Plug in the LSM from db-09 or the B-tree from db-15 as the underlying KV store. The replication layer (this lab) shouldn't have to change.

10. Geo-replication

A second tier of replication across regions, with the per-region cluster acting as a single logical replica. Conflict resolution becomes the central question.

Step 01 — Cluster and log

Goal

Define the three core types — Op, LogEntry, Node — and the container Cluster that holds three nodes. No replication yet; the leader appends to its own log only.

Tasks

Define OpKind as Put | Del and Op { kind, k: i64, v: i64 }.
Define LogEntry { term: u64, index: u64, op: Op }.
Define Node { id: u8, term: u64, commit_index: u64, last_applied: u64, log: Vec<LogEntry>, kv: Map<i64,i64> }.
Implement Node::append requiring entry.index == log.len() + 1, with idempotent re-acceptance of an already-present index (used later by catch_up).
Implement Node::apply_committed: while last_applied < commit_index, apply log[last_applied] to kv and increment.
Implement Node::encode with the canonical wire format from CONCEPTS.md.
Implement Cluster::new with three nodes (ids 0, 1, 2) all marked up, and Cluster::encode_snapshot = concat of all three encodings.

Acceptance

snapshot_layout_smoke test passes in all three languages.
An empty cluster's snapshot has length 3 × (8+1+8+8+4+4) = 99 bytes.

Pitfalls

Go map iteration order is undefined — sort keys before encoding.
std::map (ordered) in C++, NOT std::unordered_map.
All multi-byte ints are little-endian.
v for a Del op encodes as 0.

Step 02 — Replication and commit

Goal

Wire Cluster::submit so that one Op propagates from the leader to every live follower, advances commit_index on majority, and applies into the local kv state.

Tasks

In submit(op):
- Compute leader_idx = leader.log.len() + 1.
- Build LogEntry { term: leader.term, index: leader_idx, op }.
- Append on the leader (must succeed).
- For each follower id 1, 2: if up[fid], append on that follower.
Count acks: start at 1 (leader), then +1 for each up follower whose log.len() >= leader_idx.
If acks >= 2 (majority of 3):
- Set leader.commit_index = leader_idx; call leader.apply_committed.
- For each follower whose log.len() >= leader_idx, set its commit_index to leader_idx and call apply_committed.

Acceptance

put_then_del_replicates test passes in all three languages.
After three submits in a row to a fresh cluster, every node has log.len() == 3 and commit_index == 3.

Pitfalls

Don't advance commit_index on a follower that hasn't received the entry — that's how silent divergence happens.
The leader always advances on a majority, even if a follower hasn't ack'd, because the leader itself counts.
apply_committed must be called after commit_index is bumped, not before.

Step 03 — Fault injection and catch-up

Goal

Add the failure-injection schedule, the catch_up operation, and the top-level run_cluster workload driver — completing the lab.

Tasks

Implement Cluster::set_follower_up(fid, up) (assert fid is 1 or 2, never 0).
Implement Cluster::catch_up(fid):
- Snapshot the leader's log and commit_index.
- While the follower's log.len() is less than the leader's, append leader_log[fol.log.len()] to the follower.
- If the follower's commit_index is below the leader's, set it to the leader's and apply_committed.
Implement step_op(rng, keys):
- Draw r1, r2, r3 = rng.next() (always three).
- kind = (r1 >> 62) & 0x3; 0,1,2 → Put, 3 → Del.
- k = i64(r2 % keys), v = i64(r3 % 1000).
Implement run_cluster(seed, ops, keys, scenario):
- down_start = ops/2, down_end = (ops*3)/4, with_fault = (scenario == "fault").
- For i in 0..ops:
  - If with_fault && i == down_start: set follower 1 down.
  - If with_fault && i == down_end: set follower 1 up, then catch_up(1).
  - submit(step_op(rng, keys)).
- After the loop: if with_fault && !up[1], set follower 1 up and catch_up(1). (Handles ops % 4 != 0.)
Write a dbctl hash workload --seed N --ops N --keys N --scenario <normal|fault> CLI that prints the SHA-256 hex of run_cluster(...).encode_snapshot() with no trailing newline.
Freeze the two scenario hashes as named constants and assert them in two tests per language. Cross-check with scripts/cross_test.sh.

Acceptance

verify.sh ends with === OK ===.
cross_test.sh ends with === ALL OK ===.
The two frozen hashes
- 5976b45b9f40f440e8249da27fe4fe752e005f606efc3596bdb25ca4e4f99296 (normal, seed=42 ops=200 keys=16)
- d67c36725af65242e985a308db5152af2a3e2525fab33d11ed6e826a252ff792 (fault, seed=7 ops=2000 keys=128) match across Rust, Go, and C++.

Pitfalls

Drawing fewer RNG words on the Del branch will silently desync hashes — always draw three.
The post-loop catch-up matters: if the run ends inside the down window, follower 1 still needs to converge.
catch_up must clone the leader's log first; mutating both at once in Rust requires careful borrow handling.
The "ack on up[fid] only" rule is essential: a down follower contributes zero acks regardless of its log length.

Phase 6 — Cloud Gateway & Application Networking

Target role: Distributed Systems Engineer 5 — Cloud Gateway, Netflix Application Networking Group.

Phases 1–5 of this book build the systems that live behind a request: storage engines, B-trees, LSM trees, and the consensus protocols that keep replicated state honest. Phase 6 builds the systems that live in front of a request: the L4/L7 data plane, the API gateway, the WebSocket fleet, the control plane that programs them, and the Kubernetes substrate they run on.

This is the same discipline — reliable, scalable distributed systems — pointed at a different layer of the stack. A gateway is a distributed system: it has a data plane (many stateless-ish proxies) and a control plane (a consensus-backed source of truth that pushes config to the fleet). If you understood why Raft commits only current-term entries (db-17) you already understand why an xDS control plane must be careful about config propagation ordering (gw-08). The skills transfer; the vocabulary changes.

Why this phase exists

You have a strong foundation in storage and consensus. The Netflix Cloud Gateway role asks for a different, adjacent body of knowledge that the first five phases do not cover:

The role wants…	…and Phases 1–5 cover	Phase 6 lab
L4 (TCP/UDP) expertise	(nothing)	gw-01
L7 (HTTP/S, gRPC, WebSockets)	(nothing)	gw-02, gw-05
API Gateway tech (Zuul, Envoy, Gateway API)	(nothing)	gw-03, gw-08, gw-10
Kubernetes internals (Networking, CNI, CRDs, Operators)	(nothing)	gw-09, gw-10
Data-plane + control-plane design	consensus core (db-16…20)	gw-08
Resilience posture	quorums, partitions (db-17…20)	gw-06
Security posture	(nothing)	gw-07
Observability posture	dump/oracle debugging (db-17)	gw-11
Leading large-scale migrations	(nothing)	gw-12

How this phase is structured

The 12 labs follow the same shape as the rest of the book — CONCEPTS.md (the 8-part framework), references.md, docs/, and steps/ with runnable code — with one deliberate difference. Phases 1–5 prove correctness with byte-identical cross-language dumps. Networking systems are not byte-deterministic by nature (timers, kernel scheduling, connection ordering), so Phase 6 proves things the way the industry actually does: runnable mini-implementations you can point a load generator at, plus the metrics that tell you whether they work under stress. You will build, in Go (the lingua franca of the cloud-native data plane) with Java/Netty where the Zuul lineage matters:

a non-blocking L4 TCP proxy with backpressure and connection draining (gw-01),
an HTTP/2 frame parser and a multiplexing demo (gw-02),
a filter-chain API gateway in the Zuul 2 shape (gw-03),
a per-event-loop pooled, subsetted connection manager that reproduces the connection-churn win (gw-04),
a WebSocket push proxy with a push registry and async delivery (gw-05),
power-of-two-choices load balancing and an adaptive concurrency limiter (gw-06),
an xDS control plane that drives Envoy via go-control-plane (gw-08),
a Kubernetes operator (controller-runtime) that reconciles a Gateway CRD into data-plane config (gw-10),
trace-context propagation through a proxy and RED dashboards (gw-11).

The talks, decoded

The job posting links seven talks. Read them as a map of what the team values and what you'll be expected to reason about on day one. Each maps to a lab so you can go deep where it counts.

Talk	What it's really about	Lab to study
Evolution of Edge @ Netflix	Zuul 1 (blocking) → Zuul 2 (Netty, async non-blocking); the edge gateway as a programmable filter pipeline; push vs pull config	gw-03
Curbing Connection Churn in Zuul	Origin connection reuse: per-event-loop pools + subsetting via a low-discrepancy (Van der Corput) ring; ~8× fewer TCP opens, churn from thousands/s → ~60/s	gw-04
Pushy to the Limit	Evolving a WebSocket proxy to hundreds of millions of concurrent connections; push registry on KeyValue; Kafka message processor; 60k→200k→400k conns/node	gw-05
AWS re:Invent 2018 — Scaling Push Messaging	The original Zuul Push / Pushy architecture: persistent connections, push registry lookup, async message fan-out	gw-05
Show Must Go On — Securing Netflix Studios at Scale	mTLS everywhere, identity (Metatron/SPIFFE-style SVIDs), authz at the edge, zero-trust for partner/studio traffic	gw-07
Managing Netflix's Compute with Kubernetes & Dynamic…	Titus → Kubernetes; custom schedulers/controllers; running the gateway fleet on K8s	gw-09, gw-12
Container Runtime Customization (NRI & OCI Hooks)	containerd NRI + OCI hooks to customize networking/storage/sidecars per workload while staying K8s-compatible	gw-09, gw-12

The throughline: this team has spent a decade learning that the expensive failures at the edge are not algorithmic — they are operational. Connection churn, thundering herds on config push, a WebSocket fleet that can't be drained gracefully, a retry storm that turns a brown-out into an outage. The talks are a catalog of hard-won operational lessons. Phase 6 teaches you to reason about them before you cause them.

What "Distributed Systems Engineer 5" actually means at Netflix

Netflix's IC ladder runs Senior (4) → Senior 5 → Staff/Principal. A "5" is expected to:

Own a problem domain, not a ticket. "Make origin connections stop churning" is the assignment; you scope it, design it, align stakeholders, ship it, and operate it. (gw-12)
Use data to find root cause. The JD says this twice. Every lab's docs/observation.md is written in this spirit: what to measure, how to read it, how to tell correlation from cause.
Mentor and set the bar. Design reviews and code reviews are listed as core responsibilities. The docs/analysis.md in each lab is modeled on a design-review document.
Lead migrations. Listed as a "plus" but it's the differentiator. gw-12 is a full playbook.

Netflix's culture deck terms you'll hear in interviews and should be ready to speak to: "context, not control" (you'll be given the why, not the how), "freedom & responsibility", "highly aligned, loosely coupled" (the org-design analog of microservices), and "the keeper test." Map your stories to these.

A note on languages

The JD says Java, Go, or C++. In this domain:

Java/Netty is the Zuul/Pushy lineage. If you interview with the team that owns Zuul, expect Netty event-loop questions (gw-03, gw-05).
Go is the cloud-native control-plane lingua franca: go-control-plane (Envoy xDS), controller-runtime (operators), most CNI plugins, Kubernetes itself. Phase 6's runnable code is mostly Go for this reason.
C++ is Envoy's data plane. You won't write Envoy from scratch, but you should be able to read a filter and reason about its buffering and lifecycle (gw-08).

Pick the one you'll be tested in and make the runnable labs idiomatic in it; skim the other two.

Suggested path through Phase 6

gw-01 (L4)  ─→  gw-02 (L7)  ─→  gw-03 (API gateway)
                                     │
        ┌────────────────────────────┼───────────────────────────┐
        ↓                            ↓                            ↓
   gw-04 (conn mgmt)          gw-06 (resilience)           gw-07 (security)
   gw-05 (websockets)
        │
        └──→ gw-08 (Envoy/xDS) ──→ gw-09 (K8s net) ──→ gw-10 (Gateway API/operator)
                                                              │
                                          gw-11 (observability)
                                                              │
                                          gw-12 (migration capstone)

Do gw-01 → gw-03 in order (each builds the vocabulary for the next). After gw-03 the branches are independent; pick by what your interview loop emphasizes. gw-12 assumes all of them.

See the per-lab interview-readiness sections below.

Interview-readiness index

Every CONCEPTS.md in this phase ends with a §7 Interview Talking Points section written specifically for a senior Cloud Gateway loop. For a fast pass before an onsite, read just those eight sections plus:

gw-00 HITCHHIKERS-GUIDE.md — read this first: the warm-up primer that builds the data-plane/control-plane mental model, the request-lifecycle map, the distributed-systems throughline, and how to use the runnable code.
gw-00 INTERVIEW.md — the system-design playbook for gateway problems, behavioral-story mapping, the 30-60-90 day plan, and questions to ask them.
gw-00 references.md — every talk, blog, RFC, and paper that this phase is built on, with one-line "why read it."

Every lab also ships a maintainer-level GUIDE.md (the deep, hands-on companion to its CONCEPTS.md) and real, go test -race-green Go in src/go/. Verify the whole phase with verify-all.sh.

The Hitchhiker's Guide to the Cloud Gateway

A warm-up primer for Phase 6. Read this first. It builds the mental model the whole phase hangs on, shows you how to use the runnable code, and gives you the throughline that turns twelve labs into one coherent story you can tell in an interview.

Don't panic. By the end of this phase you will have built — in real, tested, runnable Go — an L4 proxy, an HTTP/2 parser, a Zuul-shaped API gateway, Netflix's connection-churn fix, a Pushy-style WebSocket fleet, the full resilience toolkit, an mTLS gateway, an xDS control plane, a Kubernetes operator, the observability primitives, and a migration rollout engine. Every bash scripts/verify.sh is green.

1. The one idea that unlocks everything: data plane vs control plane

A cloud gateway is not one program; it's two systems with opposite goals, and almost every design question resolves once you name which one you're talking about.

            CONTROL PLANE  (correctness-optimized, OFF the request path)
        ┌───────────────────────────────────────────────────────┐
        │  source of truth → reconcile → versioned config         │
        │            │  push (xDS / operator)                      │
        └────────────┼──────────────────────────────────────────┘
                     ▼
            DATA PLANE  (p99/throughput-optimized, ON the request path)
   client ──▶ [ L4 → TLS → L7 decode → filters → LB+pool → proxy → log ] ──▶ origins

The data plane is the fleet of proxies on the request path. It is optimized for p99 latency, throughput, and availability. It must keep serving even when the control plane is down (it runs on last-known-good config). gw-01, gw-02, gw-03, gw-04, gw-05, gw-06, gw-07 are all data-plane.
The control plane is the source of truth that computes config and pushes it to the fleet. It is optimized for correctness and safe propagation, not latency. Its outage stops config changes, not traffic. gw-08 and gw-10 are control-plane; gw-11 observes both; gw-12 changes both safely.

Say "I'll split this into a data plane and a control plane" in the first 60 seconds of any gateway design interview. It is the single highest-signal move you can make, and the rest of this phase is the detail behind it. (See INTERVIEW.md for the full playbook.)

2. The request lifecycle (and which lab owns each hop)

Trace one request through an L7 gateway and you've toured the whole phase:

accept the TCP connection .................... gw-01 (L4: sockets, backpressure, drain)
terminate TLS / verify mTLS identity ......... gw-07 (mTLS, SPIFFE, rotation)
decode the protocol (HTTP/2 frames, gRPC) .... gw-02 (L7 framing, HPACK, flow control)
run inbound filters: authn, route, rate-limit  gw-03 (the Zuul filter chain)
pick an origin: LB over a pooled subset ...... gw-04 (pooling+subsetting) + gw-06 (P2C)
proxy with resilience: timeout/retry/breaker.. gw-06 (retries, circuit breaker, adaptive)
run outbound filters: rewrite, access log .... gw-03 + gw-11 (RED metrics, trace span)
                            ▲
   the proxies above RUN ON Kubernetes ........ gw-09 (CNI, kube-proxy, drain ordering)
   their config is PUSHED by a control plane .. gw-08 (xDS) / gw-10 (operator)
   and CHANGES ship via a rollout ladder ...... gw-12 (shadow → canary → ramp)

If you can narrate that lifecycle and name where each concern lives, you can hold a Cloud Gateway systems-design round.

3. The distributed-systems throughline (why your db-* work transfers)

Phases 1–5 built storage and consensus. Phase 6 looks different but is the same discipline pointed at a new layer. The connections are real, not rhetorical:

Backpressure is one idea at three layers. TCP's receive window (gw-01), HTTP/2's flow-control credits (gw-02), and your proxy's bounded copy buffer (gw-01) are all "slow the producer when the consumer can't keep up." Adaptive concurrency (gw-06) is the same idea as admission control.
Config propagation is a consistency problem. An xDS control plane pushing versioned, ACK'd config to a fleet (gw-08) has the same hazards as committing a replicated log (db-17): ordering (ADS exists for the same reason Raft cares about log order), versioning, and acknowledgment. Last-known-good is the data plane's "the cluster keeps serving during an election."
Reconciliation is the consensus mindset at the app layer. A Kubernetes operator (gw-10) converging actual→desired state, level- triggered and idempotent, is "drive the system to a replicated desired state, safely, no matter the intermediate failures."
Quorum/majority intuition shows up in subset sizing. "How big must a gw-04 subset be to survive losing f origins?" is db-17's majority argument.
Determinism is your test oracle. The db-* labs proved correctness with byte-identical dumps; Phase 6 proves it with content-hash versioning (gw-08, gw-10), deterministic simulations (gw-06, gw-09, gw-12), and seeded RNGs (gw-04, gw-06) so every test is reproducible.

You are not learning a new field. You are applying the same rigor to the front door instead of the back room.

4. How Phase 6 is built (and why it differs from Phases 1–5)

Phases 1–5 prove correctness with byte-identical cross-language dumps (Rust/Go/C++). Networking systems aren't byte-deterministic — timers, kernel scheduling, connection ordering — so Phase 6 proves things the way the industry actually does: runnable mini-implementations you can point load at, plus the metrics that show they work under stress.

Concretely, every lab gw-NN ships:

gw-NN-name/
├── CONCEPTS.md      # the 8-part framework (the "why")
├── GUIDE.md         # the deep, hands-on companion (read with the code open)
├── references.md    # papers, RFCs, the Netflix talks, source to read
├── docs/
│   ├── analysis.md      # design-review-style trade-offs
│   ├── execution.md     # how to build/run
│   └── verification.md  # what "green" proves (and doesn't)
├── steps/           # staged, code-rich implementation guides
├── scripts/verify.sh    # builds + vets + tests (offline, stdlib-only)
└── src/go/          # REAL, COMPILABLE, TESTED Go — the thing you hack on

The code is stdlib-only Go (the cloud-native data-plane lingua franca), so it builds and tests offline with zero dependencies — exactly like the db-* labs. Where the Zuul/Pushy lineage matters, the GUIDE points at the Java/Netty equivalents. Each verify.sh runs go test -race and (where there's one) a demonstration program that prints the lab's headline result.

How to work a lab: read CONCEPTS.md for the why → open GUIDE.md next to src/go/ and read the code it walks you through → run bash scripts/verify.sh and read the test names (they're the spec) → run the demo CLI → do the exercises at the end of the GUIDE. The exercises are the interview.

5. The headline results you can reproduce

These are the "you won't find this in a book" moments — run them:

Lab	`verify.sh` shows	Why it matters
gw-04	`500000 → 10000` connections (50× fewer); ring changes 11/500 vs hash-mod 259/500 on membership change	reproduces Netflix's connection-churn win + proves why the Van der Corput ring is stable
gw-06	a transient spike becomes a permanent outage under naive retries; a retry budget recovers to baseline	makes metastable failure tangible
gw-08	versioned push + ACK, debounced no-op, and an inconsistent config rejected (last-known-good kept)	the control-plane correctness model
gw-10	self-heal without a CR change; idempotent (1 push); finalizer cleanup	the operator pattern, correct
gw-11	avg-of-p99 = 0.051s (WRONG) vs merged-p99 = 0.100s (RIGHT)	you cannot average percentiles
gw-12	a healthy ramp to 100%, and an SLO breach at 25% auto-rolling-back to 5%	how you ship to a fleet safely

6. The seven talks, as your study map

The JD links seven talks. They are not background reading — they are the exact problems this team works on, and each maps to a lab whose code you can run:

Talk	Lab
Evolution of Edge @ Netflix / Zuul 2 async	gw-03
Curbing Connection Churn in Zuul	gw-04
Pushy to the Limit / Scaling Push Messaging	gw-05
Securing Netflix Studios at Scale	gw-07
Managing Netflix's Compute with Kubernetes	gw-09, gw-12
Container Runtime Customization (NRI & OCI)	gw-09, gw-12
(the service-mesh / on-demand discovery direction)	gw-08

Watch each talk, then read the matching GUIDE with the code open. You'll recognize every system they describe, and you'll have built a miniature of it.

7. Suggested path

gw-01 (L4) ─▶ gw-02 (L7) ─▶ gw-03 (API gateway)        [do these in order]
                                  │
        ┌─────────────────────────┼──────────────────────────┐
        ▼                         ▼                           ▼
   gw-04 (conn mgmt)       gw-06 (resilience)          gw-07 (security)
   gw-05 (websockets)
        │
        └─▶ gw-08 (xDS) ─▶ gw-09 (K8s net) ─▶ gw-10 (operator) ─▶ gw-11 (observability)
                                                                      │
                                                          gw-12 (migration capstone)

Do gw-01 → gw-03 in order — each builds the vocabulary for the next. After gw-03 the branches are independent; pick by what your interview loop emphasizes. gw-12 assumes the rest.

To verify the entire phase at once:

for d in gw-0*/ gw-1*/; do [ -f "$d/scripts/verify.sh" ] && (echo "== $d ==" && bash "$d/scripts/verify.sh" >/dev/null && echo OK); done

Now go read gw-01's CONCEPTS and GUIDE. The floor of the stack is a socket and two file descriptors — start there.

Phase 6 — Interview Readiness Playbook

A senior Cloud Gateway loop at a place like Netflix typically has five kinds of conversations. This is how to be ready for each, with the labs that back each one up.

A coding screen — usually a data-structure/algorithm problem, but sometimes "parse this protocol" or "implement a rate limiter."
A systems-design round — "design an API gateway / a push-message system / a global load balancer." This is where you win or lose.
A deep-dive / "tell me about a hard thing you built" — they probe one project to the bottom. Have one story that goes 6 layers deep.
An operations / debugging round — "latency p99 just doubled, walk me through it." Data-driven root-cause, as the JD demands.
Behavioral / culture — ownership, conflict, mentorship, Netflix culture-deck alignment.

1. The systems-design playbook for gateway problems

Gateway design questions reward a specific structure. Use this skeleton; it signals you've actually operated one of these.

Step 0 — Separate data plane from control plane out loud

The single highest-signal move. Say it in the first 60 seconds:

"I'll split this into a data plane — the proxies on the request path, optimized for p99 latency and connection efficiency — and a control plane — the source of truth that computes config and pushes it to the fleet, optimized for correctness and safe propagation. They have completely different SLOs and failure modes."

Then draw the box diagram:

            control plane (correctness-optimized)
        ┌──────────────────────────────────────────┐
        │  config store → reconciler → xDS server   │
        └───────────────┬──────────────────────────┘
                        │ push config (LDS/RDS/CDS/EDS)
        ┌───────────────▼──────────────────────────┐
   ───▶ │  data plane: L4/L7 proxies (p99-optimized) │ ───▶ origins
 client └────────────────────────────────────────────┘

Step 1 — Pin the numbers

Always quantify. For a Netflix-scale edge: O(1M+) rps, hundreds of millions of devices, persistent connections for push. Derive: connections per node, memory per connection, config fan-out size, config change rate. "If a proxy holds 200k WebSocket connections at ~10KB state each that's ~2GB of connection state alone" tells the interviewer you think in budgets. (gw-05 has the real Pushy numbers.)

Step 2 — Walk the request lifecycle

For an L7 gateway, narrate the path and name where each concern lives:

accept (L4)
  → TLS terminate / mTLS verify        (gw-07)
  → decode HTTP/2 frames, demux streams (gw-02)
  → inbound filters: authn, routing, rate limit, header manip (gw-03)
  → pick origin: LB + subsetting from a pooled connection         (gw-04, gw-06)
  → proxy request body with backpressure                          (gw-01)
  → outbound filters: retries, circuit breaking, response rewrite (gw-06)
  → emit access log + trace span + metrics                        (gw-11)

Step 3 — Name the failure modes before they ask

This is the senior signal. Volunteer:

Retry storms / metastable failure. A blip makes everyone retry, the retries are the new load, the system can't recover even after the original cause clears. Mitigation: retry budgets (cap retries to a % of original traffic), circuit breakers, load shedding, adaptive concurrency. (gw-06)
Connection churn. Closing origin connections after each request burns CPU on TLS handshakes and floods the origin's accept queue. Mitigation: keep-alive pools per event loop + subsetting so a large fleet doesn't fan out to every origin. (gw-04)
Thundering herd on config push. A control-plane change that invalidates every connection at once. Mitigation: jittered, incremental rollout; delta xDS; warm the new config before cutover. (gw-08)
Head-of-line blocking. At L4 (one slow connection), at L7 (HTTP/1.1 pipelining), and even in HTTP/2 (TCP-level HOL — the reason for HTTP/3). (gw-02)
Graceful drain of stateful nodes. You can't just SIGKILL a node holding 200k WebSockets. You need connection draining and client reconnect with backoff + jitter. (gw-05, gw-01)

Step 4 — Close with observability and rollout

"I'd ship this behind a flag, mirror production traffic to it (shadow), compare RED metrics and a golden-signal diff, then ramp with a sticky canary." That sentence alone reflects gw-11 and gw-12.

The five canonical questions, pre-solved

Question	The 30-second spine	Lab
"Design an API gateway"	data/control split → filter chain → conn pool + LB → resilience → observability	gw-03, gw-04, gw-06
"Design push notifications for 300M devices"	WebSocket fleet + push registry (device→node) + Kafka fan-out + reconnect strategy	gw-05
"Design a global/edge load balancer"	anycast/DNS → L4 → L7 → P2C + zone-aware + outlier ejection	gw-06
"Roll out a new proxy to the whole fleet safely"	shadow → canary → ramp w/ automated rollback on SLO breach	gw-12
"Service mesh / how do 1000 services find each other"	control plane + xDS + on-demand cluster discovery + mTLS identity	gw-08, gw-07

2. The deep-dive round

Have one project you can take six layers deep. Structure it as: problem → constraints → options-you-rejected → design → the bug you didn't expect → what you measured → what you'd do differently. The "options you rejected" and "the bug you didn't expect" are what separate a 5 from a 4. Each lab's docs/analysis.md is written as a model of this: it always has a "Tradeoffs worth flagging" and a "what production needs beyond this lab" section. Steal that structure for your own story.

3. The operations / debugging round

The JD: "identify root causes using data." Practice this script on a prompt like "p99 latency doubled at 14:03, go":

Scope it. One region or global? One route or all? One origin or all? (Narrow the blast radius before theorizing.)
Golden signals. Rate, Errors, Duration, Saturation. Which moved first? (gw-11)
Correlate, don't guess. Did a deploy, a config push, or an origin event line up with 14:03? Overlay the timelines.
Walk the path. Is the latency in TLS, in the filter chain, in the LB pick (a hot/ejected origin?), in the origin itself, or in connection acquisition (pool exhaustion → new connects → churn)?
Form a falsifiable hypothesis, then find the one metric that confirms or kills it. ("If it's pool exhaustion, pool.pending and connections.created.rate should be up. They are → it's churn.")
Mitigate, then fix. Shed load / shift traffic / roll back now; root-cause and durable fix after.

Memorize a short list of "what each symptom usually means":

Symptom	Usual suspects
p99 up, p50 flat	tail amplification: retries, a slow origin in the LB set, GC pause, pool contention
errors up, latency flat	circuit breaker open, origin 5xx, auth/cert expiry
CPU up, rps flat	TLS handshake churn (connection churn!), inefficient filter, logging
connections-created.rate spikes	pool too small, subset too small, keep-alive disabled, origin closing conns
memory climbs on push nodes	connection leak, slow consumers, no idle eviction

4. Behavioral / Netflix-culture mapping

Prepare 3–4 stories and tag each with the dimensions below. Netflix specifically interviews for these.

Dimension	What they're listening for	Have a story about…
Ownership	you scoped + shipped + operated, end to end	a project you drove from ambiguity to production
Judgment under ambiguity	"context not control" — you made the call with incomplete info	a reversible decision you made fast, an irreversible one you made carefully
Impact via data	you measured, you were sometimes wrong, you let data win	a hypothesis the data killed
Selflessness / "highly aligned, loosely coupled"	you aligned stakeholders without controlling them	a cross-team migration or a contentious design review
Mentorship	you raised the bar for others	a design/code review where you changed an outcome by teaching
Candor	you gave/received hard feedback well	the keeper-test-adjacent conversation

5. The 30-60-90 day plan (say this if asked "what would your first quarter look like?")

Days 0–30 — Learn the system, earn trust. Read the data-plane and control-plane code paths end to end. Trace one real request through every hop. Ship something small and safe (a metric, a filter, a runbook fix) to learn the deploy pipeline. Pair on an on-call shadow. Map the stakeholders.
Days 30–60 — Own a slice. Take a real problem in your domain (e.g., a connection-efficiency or observability gap). Write the design doc. Run it through a design review as the author. Start the build behind a flag.
Days 60–90 — Ship and measure. Shadow → canary → ramp the change. Show a before/after metric. Write the postmortem-style retro even if nothing broke. Begin mentoring a junior on the next slice.

The thread: reduce risk early, deliver measured impact by the end of the quarter, leave the system better-documented than you found it.

6. Questions to ask them (these signal seniority)

"Where does the data plane / control plane boundary sit today, and what's the config-propagation SLO — how fast and how safely does a change reach the whole fleet?"
"What's your current strategy for origin connection efficiency — are you on per-event-loop pools and subsetting fleet-wide yet?"
"For the WebSocket/push tier, how do you drain a node holding hundreds of thousands of connections during a deploy?"
"How far along is the move onto the Kubernetes Gateway API, and where does Envoy vs the Zuul lineage fit in the target architecture?"
"What does a large migration look like here operationally — shadow traffic, sticky canaries, automated rollback on SLO breach?"
"What's the most surprising production failure mode the team has hit in the last year?" (Their answer tells you what you'd actually work on.)

7. The one-page cheat sheet

Print this. Read it in the lobby.

DATA PLANE vs CONTROL PLANE — say it first, always.
NUMBERS — rps, conns/node, config size, change rate. Always quantify.
REQUEST LIFECYCLE — accept→TLS→decode→filters→LB+pool→proxy→filters→log.
FAILURE MODES — retry storms (budgets!), churn (pools+subsetting),
                config thundering-herd (delta+jitter), HOL (h3),
                draining stateful nodes (reconnect+backoff+jitter).
RESILIENCE — P2C LB, zone-aware, outlier ejection, circuit breaker,
             adaptive concurrency (Little's law), load shedding.
SECURITY — mTLS everywhere, SPIFFE/SVID identity, authz at edge.
ROLLOUT — flag → shadow/mirror → sticky canary → ramp → auto-rollback.
DEBUG — scope → golden signals → correlate → walk path → falsifiable
        hypothesis → mitigate then fix.
CULTURE — ownership, context-not-control, highly-aligned/loosely-coupled,
          impact via data, candor.

Phase 6 — References

The whole reading list for the Cloud Gateway role, grouped by theme. Per-lab references.md files go deeper; this is the master index.

The Netflix talks & posts named in the JD (read these first)

Curbing Connection Churn in Zuul — Netflix TechBlog, 2023. The per-event-loop connection pool + subsetting work; the headline result (≈8× fewer TCP opens, churn from thousands/s to ~60/s). Directly named in the JD. https://netflixtechblog.com/curbing-connection-churn-in-zuul-2feb273a3598 — Arthur Gonigberg's companion write-up: https://arthur.gonigberg.com/2023/10/03/curbing-connection-churn/
Pushy to the Limit: Evolving Netflix's WebSocket proxy for the future — Netflix TechBlog, 2024. Hundreds of millions of concurrent WebSocket connections, push registry on KeyValue (was Dynomite), Kafka message processor, 60k→200k→400k conns/node. https://netflixtechblog.com/pushy-to-the-limit-evolving-netflixs-websocket-proxy-for-the-future-b468bc0ff658
Scaling Push Messaging for Millions of Devices — InfoQ / re:Invent 2018. The original Zuul Push / Pushy architecture. https://www.infoq.com/news/2018/07/zuul-push-messaging/
Zuul 2: The Netflix Journey to Asynchronous, Non-Blocking Systems — Netflix TechBlog. The Netty rewrite; event loop vs thread-per-request; ~25% throughput gain; 80+ Zuul clusters, >1M rps. https://netflixtechblog.com/zuul-2-the-netflix-journey-to-asynchronous-non-blocking-systems-45947377fb5c
Zero Configuration Service Mesh with On-Demand Cluster Discovery — Netflix TechBlog. Envoy + a control plane that discovers clusters lazily; the service-mesh direction of the Studios-security talk. https://netflixtechblog.com/zero-configuration-service-mesh-with-on-demand-cluster-discovery-ac6483b52a51
Titus, the Netflix container management platform — TechBlog + ACM Queue "Titus: Introducing Containers to the Netflix Cloud." The Mesos→Kubernetes story behind "Managing Netflix's Compute with K8s." https://netflixtechblog.com/titus-the-netflix-container-management-platform-is-now-open-source-f868c9fb5436 · https://queue.acm.org/detail.cfm?id=3158370
Container Runtime Customization at Netflix (NRI & OCI Hooks) — KubeCon NA. containerd NRI + OCI hooks for per-workload runtime customization while staying K8s-compatible. https://github.com/containerd/nri

Source code worth reading

Netflix/zuul (Java/Netty) — the reference open-source API gateway. Read BaseZuulFilter, the filter types (inbound/endpoint/outbound), and the Netty ChannelHandler pipeline. https://github.com/Netflix/zuul
envoyproxy/envoy (C++) — the canonical L4/L7 proxy. Read the listener → filter chain → cluster model; one HTTP filter end to end. https://github.com/envoyproxy/envoy
envoyproxy/go-control-plane (Go) — build an xDS server (gw-08). https://github.com/envoyproxy/go-control-plane
kubernetes-sigs/gateway-api + controller-runtime (Go) — the Gateway API CRDs and the operator framework (gw-10). https://github.com/kubernetes-sigs/gateway-api · https://github.com/kubernetes-sigs/controller-runtime
containernetworking/cni + cilium/cilium — the CNI spec and an eBPF dataplane (gw-09). https://github.com/containernetworking/cni
spiffe/spire — workload identity / SVID issuance (gw-07). https://github.com/spiffe/spire

Protocol specs (the L4/L7 canon)

TCP — RFC 9293 (consolidated). UDP — RFC 768.
HTTP/1.1 — RFC 9110 (semantics) + RFC 9112 (syntax).
HTTP/2 — RFC 9113. HPACK — RFC 7541.
HTTP/3 — RFC 9114. QUIC — RFC 9000. QPACK — RFC 9204.
WebSocket — RFC 6455.
TLS 1.3 — RFC 8446. PROXY protocol — HAProxy spec v1/v2.
gRPC — the grpc-over-http2 wire spec in grpc/grpc.
W3C Trace Context — https://www.w3.org/TR/trace-context/

Books & long-form

Site Reliability Engineering (Google) — load shedding, retry budgets, "Addressing Cascading Failures," handling overload.
Marc Brooker's blog — timeouts, retries, jitter, metastable failures. https://brooker.co.za/blog/
Systems Performance (Brendan Gregg) — USE method, the kernel network stack, eBPF.
Kubernetes Networking docs + Networking and Kubernetes (O'Reilly).
Envoy docs — the best free explanation of a modern data plane: https://www.envoyproxy.io/docs

Cross-phase links

db-16…db-20 (consensus) — the control plane's source of truth is consensus-backed; xDS config propagation is a distributed-consistency problem in disguise. gw-08 references db-17 directly.
db-01 (storage primitives) — epoll/kqueue, zero-copy (splice/sendfile), and page-cache behavior from gw-01 build on the syscall-level intuition from db-01.

gw-01 — The L4 Data Plane: TCP/UDP, Sockets, and a TCP Proxy

This lab is the floor of the gateway stack. Before HTTP, before TLS, before any "gateway" exists, there is a kernel socket, an accept queue, two file descriptors, and the job of shoveling bytes from one to the other without copying them more than you must, without letting a fast sender overrun a slow receiver, and without leaking a single connection when you redeploy. Every higher lab — the HTTP/2 demux in gw-02, the filter chain in gw-03, the connection pool in gw-04, the WebSocket fleet in gw-05 — sits on top of the primitives here.

The JD asks for "deep expertise in L4 (TCP/UDP)." That phrase means exactly the contents of this lab: you can explain the TCP state machine, the difference between an accept queue and a SYN queue, what TCP_NODELAY actually disables, why SO_REUSEPORT matters for a multi-core proxy, how zero-copy splice works, and how you drain a node without dropping in-flight requests.

You will build a non-blocking L4 TCP proxy in Go: an event-loop acceptor, bidirectional copy with backpressure, connection draining on shutdown, and the PROXY protocol so the origin learns the real client IP. You will load-test it and read the kernel counters that tell you whether it's healthy.

1. What is it?

A Layer-4 (transport) data plane moves bytes between a client connection and an origin connection without understanding the application protocol riding on top. It operates on the transport header — TCP ports and sequence numbers, or UDP datagrams — not on URLs or HTTP methods. An L4 load balancer/proxy picks a backend per connection (or per UDP flow) and then forwards the opaque byte stream.

Contrast with L7 (gw-02/gw-03), which terminates the application protocol, can route per request, can retry, and can rewrite. L4 is dumber, faster, and protocol-agnostic; L7 is smarter, slower, and protocol-specific. Real edges use both: an L4 layer for raw throughput and DDoS absorption, an L7 layer behind it for routing and policy.

The OSI/TCP layering you must be fluent in:

 L7  application   HTTP, gRPC, WebSocket, DNS        ← gw-02, gw-03, gw-05
 L4  transport     TCP (streams), UDP (datagrams)    ← THIS LAB
 L3  network       IP, routing, anycast              ← gw-09 (CNI)
 L2  link          Ethernet, ARP, veth pairs         ← gw-09

TCP in one diagram

TCP is a reliable, ordered, byte-stream protocol built on top of unreliable IP packets. The three-way handshake and the connection state machine are interview table stakes:

client                         server
  │   SYN (seq=x)                 │      CLOSED → LISTEN (server bind+listen)
  │ ────────────────────────────▶│      ── arrives in SYN queue ──
  │   SYN-ACK (seq=y, ack=x+1)    │
  │ ◀────────────────────────────│
  │   ACK (ack=y+1)               │      ── moves to ACCEPT queue ──
  │ ────────────────────────────▶│      accept() returns a new fd
  │           ESTABLISHED         │
  │ ◀══════ data both ways ══════▶│
  │   FIN ─▶  ... ◀─ ACK/FIN      │      active close → TIME_WAIT (2·MSL)

Two kernel queues sit behind one listen():

SYN queue (incomplete): half-open connections awaiting the final ACK. Sized by net.ipv4.tcp_max_syn_backlog. SYN floods fill this; SYN cookies are the defense.
Accept queue (completed): fully established connections waiting for your process to accept(). Sized by min(backlog, net.core.somaxconn). If your accept loop is too slow, this overflows and the kernel silently drops or resets connections — one of the most common "mysterious latency" causes at a busy gateway.

UDP — why a gateway still cares

UDP is connectionless datagrams: no handshake, no ordering, no retransmit. It matters at the edge because QUIC/HTTP3 rides on UDP (gw-02), DNS is UDP, and L4 LBs must hash UDP flows ((src ip, src port, dst ip, dst port)) to a stable backend without per-packet state. The hard part of UDP load balancing is the absence of a connection: you maintain a flow table with timeouts instead of relying on FIN.

2. Why does it matter?

It's where p99 latency and CPU are won or lost. A gateway is an I/O machine. The difference between a thread-per-connection design (one OS thread blocked per socket; collapses past ~10k connections — the C10K problem) and an event-loop design (one thread multiplexing thousands of sockets via epoll/kqueue) is the difference between Zuul 1 and Zuul 2. The team's "Evolution of Edge" talk is, at its core, this transition.
Connection efficiency is a top-line metric here. The connection-churn work (gw-04) is entirely about L4 behavior: keep-alive, pooling, and not paying for a TCP+TLS handshake on every request. You cannot reason about gw-04 without the socket-level model in this lab.
Backpressure is a correctness property, not a nice-to-have. If you read from a fast client faster than a slow origin can accept, you buffer without bound and OOM the proxy. A correct L4 proxy couples the two directions: stop reading one side when the other side's write buffer is full. This is the single most common bug in homegrown proxies.
Graceful drain is the operational core of the role. Every deploy, every scale-down, every node replacement requires draining in-flight connections without dropping requests. For a stateless L7 node that's seconds; for a stateful WebSocket node (gw-05) it's a careful dance. It starts here.

3. How does it work?

The event-loop acceptor (the C10K answer)

listen_fd = socket(); bind(); listen(backlog)
register listen_fd with epoll/kqueue for READ

loop forever:
    events = epoll_wait()            # blocks until a fd is ready
    for ev in events:
        if ev.fd == listen_fd:
            conn_fd = accept4(listen_fd, NONBLOCK)   # drain the accept queue
            origin_fd = dial(pick_backend())
            register both fds for READ
        else:
            pump(ev.fd)              # non-blocking read → write to peer

One thread, thousands of connections. Go hides the epoll loop behind goroutines + the netpoller, so the idiomatic Go proxy is "two goroutines per connection" but the runtime is doing exactly the event-loop above. Java/Netty exposes it directly as the EventLoopGroup. Know both framings.

Bidirectional copy with backpressure

The heart of an L4 proxy is two coupled copies:

client ──read──▶ [proxy] ──write──▶ origin     (upstream direction)
client ◀─write── [proxy] ◀──read── origin       (downstream direction)

The critical rule: a write that blocks must stop the corresponding read. In epoll terms, when write() returns EAGAIN you deregister READ on the source until the destination signals WRITE-ready. In Go, a blocking Write on the destination naturally throttles the Read on the source because they're in the same goroutine — io.Copy gives you backpressure for free, which is why the lab uses it and then shows you what it's hiding.

Zero-copy: don't touch the bytes

A naive proxy copies each byte twice across the userspace boundary: read() (kernel→user) then write() (user→kernel). For an L4 proxy that never inspects the payload, that's pure waste. Linux splice(2) moves bytes between two fds through a kernel pipe without copying to userspace; sendfile(2) does file→socket. This is how high-end L4 proxies hit line rate. (You can't splice once you need to inspect/TLS- terminate — that's the L4/L7 cost tradeoff in one syscall.)

naive:    socket ─read→ user buffer ─write→ socket   (2 copies, 2 syscalls/buf)
splice:   socket ════════ kernel pipe ════════ socket (0 userspace copies)

The socket options that matter

Option	What it does	When you set it on a gateway
`TCP_NODELAY`	disables Nagle's algorithm (which coalesces small writes)	almost always ON for a proxy — Nagle + delayed-ACK causes 40ms stalls on small request/response
`SO_REUSEPORT`	multiple sockets bind the same port; kernel load-balances accepts across them	run N acceptor threads/processes, one per core, no accept-lock contention
`SO_REUSEADDR`	rebind a port in `TIME_WAIT`	restart without "address already in use"
`SO_KEEPALIVE` + `TCP_KEEPIDLE/INTVL/CNT`	detect dead peers on idle connections	essential for long-lived/pooled/WebSocket connections (gw-04, gw-05)
`TCP_USER_TIMEOUT`	how long unacked data may stay before the conn is dropped	bound how long a half-dead origin can hold a request
`SO_LINGER`	behavior of `close()` w.r.t. unsent data / RST	drain logic; usually leave default, understand it

PROXY protocol — preserving the client IP

When you put a proxy in front of an origin, the origin's accept() sees the proxy's IP, not the client's. For L7 you'd add X-Forwarded-For; for L4 (no HTTP to add a header to) the standard is the PROXY protocol: a small header prepended to the byte stream before the real payload, carrying the original (src ip:port, dst ip:port). v1 is text (PROXY TCP4 1.2.3.4 5.6.7.8 56324 443\r\n); v2 is binary. Your origin (or the next proxy) parses and strips it.

Connection draining on shutdown

on SIGTERM:
    stop accepting new connections (close listen_fd)
    flip readiness probe to "not ready"     # LB stops sending new conns
    wait for in-flight connections to close, up to drain_deadline
    after deadline: force-close the stragglers, log them
    exit

The readiness-probe flip is what makes this work in Kubernetes: the endpoint is removed from the Service before the pod dies (gw-09). Get the ordering wrong — exit before the LB notices — and you drop requests on every deploy.

4. Core terminology

Term	Definition
L4 / transport	Operates on TCP/UDP headers (ports, flows); payload is opaque.
SYN queue	Kernel queue of half-open connections awaiting the final handshake ACK.
Accept queue	Kernel queue of established connections awaiting `accept()`. Overflow → drops.
C10K	The historical problem of serving 10k+ concurrent connections; solved by event loops, not threads.
`epoll` / `kqueue`	Linux / BSD-macOS readiness-notification APIs; the engine under every event-loop proxy.
Backpressure	Slowing a producer because the consumer can't keep up; for a proxy, stop reading one side when the other can't be written.
Nagle's algorithm	Coalesces small TCP writes to reduce packet count; harmful for latency-sensitive small messages → disable via `TCP_NODELAY`.
Head-of-line blocking (L4)	A slow/lost segment stalls everything behind it on the same connection (TCP guarantees order).
`splice`/`sendfile`	Zero-copy data movement between fds through kernel buffers.
PROXY protocol	A header prepended to an L4 stream to convey the original client/destination addresses.
TIME_WAIT	Post-close state (2·MSL) on the active closer; too many → ephemeral-port/conntrack exhaustion.
Conntrack	The kernel connection-tracking table (NAT/firewall); a finite resource a busy gateway can exhaust.
Draining	Letting in-flight connections finish while refusing new ones before shutdown.

5. Mental models

A proxy is a bucket brigade, not a warehouse. Bytes should flow through, not pile up. If your memory grows with throughput, you've lost backpressure and turned a brigade into a warehouse that will eventually catch fire (OOM).
The accept queue is a checkout line. SYN queue = people walking toward the register; accept queue = people standing in line with full carts; accept() = the cashier. A slow cashier (slow accept loop) doesn't make the line longer forever — past somaxconn the store locks the doors (drops connections) and customers leave (RST/retransmit).
TIME_WAIT is the receipt you must keep. The active closer holds TIME_WAIT for 2·MSL so a delayed duplicate segment from the old connection can't be misread by a new connection reusing the same 4-tuple. It's correct; it just means the side that closes pays the cost — a reason gateways prefer the client to close, or use keep-alive to avoid closing at all (gw-04).
Event loop vs threads is "one chef, many pots" vs "one chef per pot." With thousands of pots (connections), hiring a chef per pot (thread-per-connection) bankrupts you on context switches and stack memory. One chef watching all the pots and stirring whichever is ready (epoll) scales. This is the entire Zuul 1 → Zuul 2 thesis.

6. Common misconceptions

"A bigger listen() backlog fixes accept-queue drops." Only if your accept loop can drain it. A huge backlog with a slow loop just delays the drop and adds latency. Fix the loop (or add acceptors via SO_REUSEPORT), then size the queue.
"TCP_NODELAY makes things faster." It reduces latency for small messages by disabling write coalescing — at the cost of more packets. For a request/response gateway it's almost always right; for bulk transfer it can hurt. It's a latency/throughput knob, not a "go faster" button.
"Load balancing at L4 and L7 is the same, just different layers." No: L4 balances connections/flows (sticky for the connection's life; one slow request blocks the connection); L7 balances requests (can spread requests from one connection across backends, retry, and hedge). The mismatch is exactly why HTTP/2 multiplexing complicates L4 LBs (gw-02).
"TLS termination is free if I have spare CPU." TLS handshakes are the expensive part (asymmetric crypto), and they happen on every new connection. This is why connection churn (gw-04) shows up as a CPU problem, and why keep-alive/pooling is a CPU optimization, not just a latency one.
"Draining is just sleep(30) before exit." Sleeping doesn't stop the LB from sending new connections, and it doesn't bound stragglers. Correct drain is: stop accepting → fail readiness → wait-with-deadline → force-close + log.

7. Interview talking points

"Walk me through what happens from listen() to accept()." Hit SYN queue vs accept queue, somaxconn, SYN cookies, and accept-queue overflow as a real latency cause. Mention ss -lnt shows Recv-Q (current accept queue depth) and Send-Q (its max) on a listening socket — naming the diagnostic command signals you've operated this.
"Why did Zuul move from thread-per-request to Netty/event-loop?" C10K: threads cost stack memory (~1MB each) and context switches; past tens of thousands of connections the scheduler dominates. An event loop multiplexes thousands of connections per thread via epoll. Cost: you must never block the event-loop thread (no synchronous I/O, no blocking locks) — the discipline that makes async code hard. ~25% throughput/CPU win at Netflix.
"How do you keep a fast client from OOMing your proxy?" Backpressure: couple the two copy directions so a stalled write pauses the corresponding read. Explain it in epoll terms (drop READ interest on EAGAIN) and Go terms (io.Copy blocks the read goroutine). Bounded buffers, not unbounded queues.
"What's TCP_NODELAY and when would you NOT set it?" Disables Nagle. Set it for latency-sensitive small messages (almost all gateway traffic). The Nagle + delayed-ACK interaction causes ~40ms stalls — a classic war story. You might leave it off for a pure bulk-data path where packet efficiency beats latency.
"How do you preserve the client IP through an L4 proxy?" PROXY protocol (v2 binary) at L4; X-Forwarded-For / Forwarded at L7. Note the trust boundary: only parse it from peers you trust, or a client can spoof their source IP.
"How do you drain a node for a deploy without dropping requests?" Stop accept → fail readiness probe → LB removes endpoint → wait for in-flight with a deadline → force-close stragglers. Tie it to Kubernetes preStop hook + terminationGracePeriodSeconds (gw-09).
"TIME_WAIT is piling up — what is it and do you care?" It's correct behavior on the active closer (prevents 4-tuple reuse hazards for 2·MSL). You care when ephemeral ports or conntrack exhaust. Fixes: keep-alive (don't close), let the client close, tune tcp_tw_reuse for outbound, widen the ephemeral range. Don't reach for the dangerous tcp_tw_recycle (removed for good reason).

8. Connections to other labs

db-01 (storage primitives) gave you the syscall-level model — pread/pwrite, the page cache, alignment. Here the same rigor applies to sockets: read/write/splice, epoll, kernel buffers.
gw-02 (L7 protocols) terminates the byte stream this lab forwards blindly; that's where you stop being able to splice and start paying to parse.
gw-03 (API gateway) wraps this acceptor + copy loop in a filter chain. The Zuul event-loop model is this lab's acceptor.
gw-04 (connection management) is the L4 optimization layer: don't re-handshake, pool and reuse the connections this lab opens.
gw-05 (WebSockets/Pushy) is this lab's draining problem at its hardest: millions of long-lived connections that can't be dropped.
gw-09 (Kubernetes networking) is where these packets actually flow — veth pairs, CNI, kube-proxy — and where readiness probes gate the drain.

gw-01 — The Hitchhiker's Guide to the L4 Data Plane

A long-form, hands-on companion to CONCEPTS.md. Read with the code in src/go/ open. Everything here is runnable and tested; nothing requires another book.

This guide takes you from "a socket is a file descriptor" to "I can operate a Netty-style L4 proxy at a million connections and explain every kernel counter that moves." It is written the way a maintainer of a proxy would explain it to a new teammate: the why behind each line, the failure that motivated it, and the experiment that proves it.

0. The 60-second map

A Layer-4 proxy is four ideas stacked:

Accept connections off a kernel queue without falling behind.
Copy bytes both ways, coupling the two directions so a slow peer throttles the fast one (backpressure).
Half-close correctly so one direction ending doesn't kill the other.
Drain in-flight connections on shutdown without dropping anyone.

The code in l4/proxy.go is exactly those four ideas and nothing else. By the end you'll understand each at the syscall level and have changed/broken/fixed each one yourself.

1. The kernel model you must carry in your head

When you call net.Listen("tcp", ":8080"), Go performs socket(), bind(), listen(). That listen(fd, backlog) creates two queues the kernel manages for you:

   incoming SYN ───▶ [ SYN queue ]  half-open, awaiting final ACK
                          │  3-way handshake completes
                          ▼
                     [ accept queue ] established, awaiting accept()
                          │  your code calls Accept()
                          ▼
                     a connected socket fd

SYN queue size ≈ net.ipv4.tcp_max_syn_backlog. SYN floods fill it; SYN cookies (net.ipv4.tcp_syncookies) let the kernel keep serving without storing half-open state.
Accept queue size ≈ min(backlog, net.core.somaxconn). This is the one that bites you. If your Accept() loop is slower than the arrival rate, this queue fills, and the kernel either drops the SYN (client retransmits, looks like latency) or sends RST (client sees "connection reset"). The counter is ListenOverflows in nstat/netstat -s.

Carry this picture everywhere. 80% of "mysterious gateway latency" stories end at "the accept queue was overflowing because something blocked the accept loop."

Why the accept loop must never block

In proxy.go, Serve does the minimum per accepted connection — increment a counter, spawn a goroutine, loop:

conn, err := ln.Accept()
...
p.wg.Add(1)
go func() { defer p.wg.Done(); p.handle(conn) }()

If handle ran inline (no goroutine), the accept loop would stall for the entire lifetime of each connection and the accept queue would overflow instantly under load. The goroutine is not a performance nicety; it's a correctness requirement. In Java/Netty the equivalent is "don't do blocking work on the boss/acceptor event loop."

Go's secret: "two goroutines per connection" looks like the thread-per-connection model that caused the C10K problem. It isn't. Go's netpoller multiplexes all those goroutines over a handful of OS threads using epoll/kqueue; a goroutine blocked in Read parks for ~few KB of stack, not a 1 MB OS-thread stack. You write blocking code; the runtime runs an event loop. That's why this lab is readable and scalable.

2. Backpressure: the bug everyone writes first

The naive proxy spawns a reader goroutine that pushes into an unbounded queue and a writer goroutine that drains it:

// THE CLASSIC BUG — do not ship this
ch := make(chan []byte, 1<<30) // "big enough"
go func() { for { b := read(src); ch <- b } }()
go func() { for b := range ch { write(dst, b) } }()

Point a fast producer (yes | nc) at a slow consumer (an origin you kill -STOP) and watch RSS climb until the OOM killer fires. The producer reads as fast as the kernel delivers; the consumer can't keep up; the queue absorbs the difference without bound. A proxy is a bucket brigade, not a warehouse.

The fix is in proxy.go's pipe:

buf := make([]byte, 32*1024) // ONE bounded buffer
for {
    nr, er := src.Read(buf)
    if nr > 0 {
        nw, ew := dst.Write(buf[:nr]) // BLOCKS until the slow side accepts it
        ...
    }
    if er != nil { break }
}

dst.Write blocks when the destination's socket send buffer is full (the kernel won't accept more until the slow peer ACKs and drains its receive window). While Write blocks, this goroutine doesn't call Read again — so we stop pulling from the fast side. Memory is bounded by one 32 KiB buffer per direction, forever, at any throughput. That is backpressure: TCP flow control on the slow side propagates, through our blocking write, into a paused read on the fast side.

io.Copy would give you the same property for free (it uses a 32 KiB buffer internally) — we hand-rolled the loop only so the byte counter updates during the connection, not just at close. Run TestProxyForwards: it asserts BytesUp/BytesDown move while the connection is still open, which is only true because we count per write.

3. Half-close: the difference between a toy and a proxy

TCP connections are two independent half-duplex streams. Either side can send FIN (no more data this way) while still receiving. Protocols rely on this: an HTTP client sends its request, half-closes, and still reads the response.

A toy proxy does defer client.Close(); defer origin.Close() and copies both ways; when the client half-closes, the toy tears down everything, killing the response. Watch pipe do it right:

func pipe(dst, src net.Conn, counter *atomic.Int64) {
    ... copy src -> dst ...
    if cw, ok := dst.(interface{ CloseWrite() error }); ok {
        cw.CloseWrite() // forward the FIN on THIS direction only
    }
}

When src (say the client) sends FIN, src.Read returns io.EOF, the loop ends, and we call dst.CloseWrite() — propagating the FIN to the origin's read side. The other pipe goroutine (origin→client) is untouched and keeps delivering the response until the origin closes its write side. TestHalfClose proves it: the origin drains to EOF, then writes "DONE", and the client still receives it.

*net.TCPConn satisfies CloseWrite(); the type assertion is how you reach the half-close without giving up the net.Conn interface.

4. Draining: every deploy is a drain

You will redeploy this proxy thousands of times. Each redeploy must not drop in-flight connections. The drain logic in Serve/drain:

ctx cancelled
  └─ ln.Close()           // stop accepting NEW connections
  └─ wait for p.wg         // in-flight handlers finish...
     └─ ...or DrainTimeout fires -> forceCloseAll() the stragglers

TestGracefulDrain holds a connection open, cancels, and asserts (a) Serve returns promptly because the deadline force-closes the straggler, and (b) new dials are refused after shutdown.

The ordering subtlety the test can't show (it needs Kubernetes, gw-09): in production you must fail the readiness probe first, so the load balancer stops sending new connections, then cancel/stop accepting. If you stop accepting before the LB notices, the LB keeps sending connections to a closed port → dropped requests on every deploy. The sequence is always: fail readiness → drain → exit.

The DrainTimeout is a policy: long enough that normal requests finish, short enough that a deploy isn't held hostage by one stuck connection. For an L7 HTTP node, seconds. For a 200k-WebSocket node (gw-05), minutes — and a 30 s default terminationGracePeriodSeconds would SIGKILL 200k live connections. Same code, wildly different timeout.

5. Hands-on lab

Build and run:

cd src/go
go build -o /tmp/l4proxy ./cmd/l4proxy

# An HTTP origin so we can drive real load:
python3 -m http.server 9000 &

# The proxy, printing churn stats every second:
/tmp/l4proxy -listen :8080 -origin 127.0.0.1:9000 -stats 1s &

Experiment A — see throughput and the accept path

# keep-alive load (one connection reused): high rps, ~0 accepts/s
wrk -t4 -c100 -d10s http://127.0.0.1:8080/

# connection-per-request load: every request is a new accept + dial.
# Watch the proxy's "accepted/s" stat explode — THIS is what curbing
# connection churn (gw-04) eliminates.
wrk -t4 -c100 -d10s -H 'Connection: close' http://127.0.0.1:8080/

Experiment B — read the kernel

# accept-queue depth on the listening socket (Recv-Q=current, Send-Q=max):
ss -lnt 'sport = :8080'

# listen-queue overflows + SYN drops (should be ~0 when healthy):
nstat -az 2>/dev/null | grep -Ei 'ListenOverflow|ListenDrop|TCPReqQFull' || \
  netstat -s | grep -Ei 'overflow|listen'

# outbound (origin-side) connections opened — the churn counter:
nstat -az 2>/dev/null | grep -i ActiveOpens

Experiment C — prove the event loop is real

# Go's netpoller IS epoll/kqueue. On Linux:
strace -f -e trace=network,epoll_pwait -p "$(pgrep l4proxy)" 2>&1 | head -40
# On macOS, use dtruss or just trust kqueue. You'll see accept4 + the
# poller syscalls; you will NOT see one thread per connection.

Experiment D — force an accept-queue overflow (the war story)

# Make handlers pile up: STOP the origin so dials hang, lower backlog,
# blast many short connections, and watch ListenOverflows climb.
kill -STOP %1                                  # freeze the python origin
for i in $(seq 1 3000); do (echo | nc -w1 127.0.0.1 8080 &) ; done
nstat -az | grep -i overflow                   # the count rises
kill -CONT %1

The client-side symptom of this is connection timeouts and resets — which look like a network problem but are really "the accept queue overflowed because dials to the origin were hanging." Knowing to look at ListenOverflows is the difference between a 10-minute and a 10-hour incident.

6. Deep dives (the maintainer details)

6.1 `TCP_NODELAY` and the 40 ms stall

Nagle's algorithm holds small writes to coalesce them into full segments; delayed ACK holds ACKs up to ~40 ms hoping to piggyback on return data. Together, on a small request/response, they deadlock for ~40 ms: sender waits for an ACK (Nagle), receiver waits for data to piggyback the ACK on (delayed ACK). setNoDelay in proxy.go sets TCP_NODELAY on both sides to disable Nagle. For a request/response proxy this is almost always right; for a bulk-transfer path you might leave Nagle on to reduce packet count. It's a latency/throughput knob, not a "go faster" button.

6.2 Zero-copy: where `splice(2)` would go

Our pipe copies bytes through a userspace buf: read() (kernel→user) then write() (user→kernel) — two copies, two syscalls per buffer. An L4 proxy that never inspects the payload doesn't need to see the bytes. Linux splice(2) moves them between two fds through a kernel pipe with zero userspace copies; high-end L4 proxies hit line rate this way. We don't use it because (a) it's Linux-only and not in Go's stdlib for arbitrary conn→conn, and (b) the moment you add TLS termination or inspection (gw-02/gw-07) you can't splice anyway — you must decrypt and parse in userspace. The L4/L7 cost difference is, quite literally, those two memory copies.

6.3 `SO_REUSEPORT`: one acceptor per core

A single accept loop is a single point of contention; at very high connect rates the kernel's accept-queue lock and your one goroutine become the bottleneck. SO_REUSEPORT lets N sockets bind the same port; the kernel hashes incoming connections across them, so you run one acceptor per core with no shared lock. To add it here, set the socket option in a net.ListenConfig{Control: ...} (see steps/02). Caveat: it spreads new connections, not load — with long-lived connections (gw-05) one acceptor can still end up hot.

6.4 TIME_WAIT and who pays for `close()`

The side that actively closes a connection holds it in TIME_WAIT for 2×MSL (~60 s on Linux) so a delayed duplicate segment from the old connection can't corrupt a new one reusing the same 4-tuple. It's correct, but it means the closer pays. A proxy that opens and closes origin connections per request accumulates TIME_WAIT on the origin side and can exhaust ephemeral ports or conntrack. The fixes, in order of preference: don't close (keep-alive/pooling — gw-04), let the client close, widen ip_local_port_range, and tcp_tw_reuse for outbound. Never reach for the long-removed tcp_tw_recycle.

6.5 conntrack: the hidden quota

If your nodes run netfilter/NAT (most Kubernetes setups do — gw-09), every connection consumes a nf_conntrack slot. A connection-heavy proxy can hit nf_conntrack_max and start dropping with cryptic nf_conntrack: table full dmesg lines. Watch net.netfilter.nf_conntrack_count vs _max. The durable fix is fewer connections (gw-04 again); the stopgap is raising the limit.

7. PROXY protocol: keeping the client IP

Put a proxy in front of an origin and the origin's accept() sees the proxy's IP. For L7 you'd add X-Forwarded-For; for L4 (no HTTP to add a header to) the standard is the PROXY protocol — a header prepended to the byte stream before the payload. proxyproto.go implements v1 (text):

PROXY TCP4 203.0.113.7 10.0.0.5 56324 8080\r\n

WriteProxyV1 emits it toward the origin; ParseProxyV1 reads and strips it on the receiving side, leaving the bufio.Reader positioned at the real payload. TestProxyProtocolEndToEnd runs the full loop: a client connects through the proxy, the origin parses the header, and the test asserts the origin recovered the client's actual source port.

The trust boundary (say this in an interview): only parse a PROXY header from peers you trust. If an untrusted client can send one, they spoof their source IP — and now your rate limits, geo rules, and audit logs are lies. Production allowlists the set of IPs permitted to speak PROXY protocol. The exact same caution applies to trusting an inbound X-Forwarded-For at L7 (gw-03/gw-07).

8. Code tour (read in this order)

l4/proxy.go — Listen/Serve/handle/pipe/drain. The whole proxy is here; ~180 lines.
l4/proxyproto.go — the PROXY v1 codec.
l4/proxy_test.go — six tests that boot real in-process proxies; read these to see the behaviors asserted.
cmd/l4proxy/main.go — the CLI + signal-driven drain + the churn-stats sampler.

Run it all: bash scripts/verify.sh (vet + build + -race test).

9. Exercises (do these — they're the interview)

Add SO_REUSEPORT (steps/02) and start two instances on :8080. Prove with ss -tnp that both accept connections. Then explain why long-lived connections can still make one hot.
Add an idle timeout: set a read deadline that resets on activity; close connections idle past N seconds. Watch it reap leaked connections. Why is too-aggressive a timeout its own bug (gw-04)?
Break backpressure on purpose: replace pipe with the unbounded- channel version and reproduce unbounded RSS growth under a kill -STOP'd origin. Revert. Now you've felt the bug.
Make drain force-close visible: add a log line in forceCloseAll and a metric for "stragglers force-closed at deadline." When would a nonzero value be alarming?
Pick a backend per connection: extend OriginAddr to a list and choose one per accepted connection (round-robin). Notice you can only balance connections, not requests — the fundamental L4 limit that motivates L7 (gw-02) and P2C (gw-06).

When you can do all five and narrate the kernel counter each one moves, you can hold the L4 portion of a Cloud Gateway interview.

gw-01 — References

Protocol specs

RFC 9293 — Transmission Control Protocol (the consolidated TCP spec; supersedes 793). The state machine and the handshake.
RFC 768 — User Datagram Protocol. Short enough to read in full.
RFC 7413 — TCP Fast Open (handshake-latency optimization).
HAProxy PROXY protocol — v1 (text) and v2 (binary) spec. https://www.haproxy.org/download/2.8/doc/proxy-protocol.txt

The C10K lineage

Dan Kegel, The C10K problem — the original statement of why thread-per-connection doesn't scale. http://www.kegel.com/c10k.html
Zuul 2: The Netflix Journey to Asynchronous, Non-Blocking Systems — the production answer at Netflix; event loop vs thread-per-request. https://netflixtechblog.com/zuul-2-the-netflix-journey-to-asynchronous-non-blocking-systems-45947377fb5c
Netty docs / Netty in Action — the event-loop framework behind Zuul and Pushy. https://netty.io/

Kernel & performance

Brendan Gregg, Systems Performance (2nd ed.) — the network stack, the USE method, epoll/kqueue, tcpdump/ss/bpftrace.
man 7 tcp, man 7 socket, man 2 epoll, man 2 splice, man 2 accept4 — read these directly; they are the spec for the code.
Cloudflare blog — The curious case of TIME_WAIT, SYN packet handling in the wild, How to stop running out of ephemeral ports. https://blog.cloudflare.com/
Marc Brooker — It's About Time and the timeouts/retries series for why connection-level timeouts must be bounded. https://brooker.co.za/blog/

Tooling to learn alongside

ss -lnti / ss -s — listening-socket accept-queue depth, per-socket TCP info, summary counters.
netstat -s / nstat — kernel TCP counters (listen overflows, SYN drops, retransmits).
tcpdump / Wireshark — see the handshake and FIN/RST on the wire.
iperf3, wrk, wrk2, vegeta — load generators for the steps.
bpftrace / bcc — trace accept-queue overflows and retransmits.

Cross-lab dependencies

Upstream: db-01 (syscalls, page cache) for the I/O intuition.
Downstream: gw-02 (terminate the stream), gw-03 (filter chain on top of the acceptor), gw-04 (pool the connections this lab opens), gw-05 (drain millions of them), gw-09 (where the packets actually flow in Kubernetes).

gw-01 — Analysis

A design-review-style look at the L4 proxy you build in steps/, and the trade-offs a senior engineer is expected to articulate.

Required behaviors

Non-blocking acceptance. The acceptor must keep draining the accept queue under load; a slow accept loop manifests as ListenOverflows in nstat and as connection resets at the client.
Coupled backpressure. Memory must stay bounded as throughput rises. If the origin stalls, the proxy must stop reading the client, not buffer without limit.
Symmetric half-close. When one side sends FIN, forward it (CloseWrite on the peer) but keep the other direction open until it too closes. A proxy that tears down both directions on the first FIN breaks request/response protocols that close one way early.
Bounded connection lifetime. Idle timeout, and a hard cap so a half-dead peer can't pin a connection forever (TCP_USER_TIMEOUT / deadlines).
Graceful drain. On SIGTERM: stop accepting, fail readiness, wait for in-flight with a deadline, force-close stragglers, log them.

Design decisions

Go goroutines over a hand-rolled epoll loop. Go's netpoller is an epoll/kqueue event loop; "two goroutines per connection" compiles down to event-driven I/O on a small thread pool. We use it because it's idiomatic and lets the lab focus on proxy semantics, then docs/observation.md shows the epoll calls underneath with strace so you see what's really happening. In Java you'd write this on Netty with an explicit EventLoopGroup; the semantics are identical.
io.Copy for the data path, deliberately. io.Copy(dst, src) blocks the goroutine on a slow dst.Write, which stops src.Read — that is backpressure, for free, with a bounded internal buffer. The step then shows the failure mode of "fixing" this with an unbounded channel between read and write goroutines (the classic OOM bug).
Half-close via CloseWrite. We type-assert to interface{ CloseWrite() error } (satisfied by *net.TCPConn) so a FIN on one side propagates as a FIN on the other without killing the reverse direction. This is the difference between a toy proxy and one that handles real protocols.
PROXY protocol v1 prepend. Simple, text, human-readable in tcpdump. v2 (binary) is what you'd use in production for efficiency and for carrying TLVs (e.g., the TLS SNI), but v1 makes the concept legible in a lab.
Drain via context + WaitGroup. A context.Context cancels the acceptor; a sync.WaitGroup tracks in-flight connections; a select on ctx.Done() vs wg with a timeout implements wait-with-deadline.

Tradeoffs worth flagging

No zero-copy. The Go lab copies through userspace. Production L4 proxies use splice(2) to avoid it; we don't, because the moment you add TLS termination or inspection (gw-02/gw-07) you can't splice anyway, and the lab's next step is exactly that. We call out where splice would go and what it would save (roughly halves syscalls and removes two memory copies per buffer).
Per-connection goroutines, not a fixed event-loop pool. At Netflix-Pushy scale (hundreds of thousands of connections per node, gw-05) the per-goroutine stack (~8KB min) adds up; you'd profile and possibly move to a fixed-pool model or tune GOMAXPROCS/stack sizes. For an L4 proxy of moderate fan-in it's fine and far more readable.
L4 can't retry. Once you've forwarded bytes to an origin and it dies mid-stream, L4 cannot transparently retry — the application data is already gone and order matters. Retries are an L7 capability (gw-06). Saying this out loud in a design review is a senior signal.
Connection-level stickiness. An L4 LB pins a whole connection to one backend. With HTTP/1.1 keep-alive that means many requests ride one backend (uneven load); with HTTP/2 it means all multiplexed streams pin to one backend (worse). The fix is L7 LB or h2-aware L4 (gw-02, gw-06).

What "production-quality" adds beyond this lab

splice/sendfile zero-copy for the inspect-free path.
SO_REUSEPORT with one acceptor per core to remove accept-lock contention and let the kernel spread accepts.
Connection and request rate limiting and concurrency caps to survive a thundering herd (gw-06).
conntrack/ephemeral-port budgeting and TIME_WAIT management for the outbound (origin) side.
A real health/readiness model integrated with the LB and Kubernetes endpoint lifecycle (gw-09), so drain actually removes you from rotation before you stop accepting.
Observability: per-connection bytes, accept-queue depth, active connection gauge, drain duration histogram (gw-11).

gw-01 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only; no modules to download — works offline).
Optional for the hands-on lab: wrk/wrk2, ss, nstat/netstat, python3 (for a throwaway origin), nc.

One-shot: prove the lab works

cd gw-01-l4-data-plane
bash scripts/verify.sh        # go vet + go build + go test -race

A green run ends with:

=== gw-01 OK ===

Per-language workflow (Go)

cd gw-01-l4-data-plane/src/go
go test -race -count=1 ./...      # 6 tests in package l4
go build -o /tmp/l4proxy ./cmd/l4proxy

Run the proxy

# origin:
python3 -m http.server 9000 &
# proxy (with per-second churn stats):
/tmp/l4proxy -listen :8080 -origin 127.0.0.1:9000 -stats 1s

CLI flags

flag	default	meaning
`-listen`	`:8080`	bind address
`-origin`	`127.0.0.1:9000`	upstream host:port
`-proxy-header`	false	emit PROXY-protocol v1 to the origin
`-drain`	`25s`	graceful-drain timeout on SIGINT/SIGTERM
`-stats`	`0`	if >0, print connection/churn stats each interval

Try it

# forward a line through the proxy:
printf 'GET / HTTP/1.0\r\n\r\n' | nc 127.0.0.1 8080 | head

# load with vs without keep-alive (watch accepted/s in the proxy log):
wrk -t4 -c100 -d10s http://127.0.0.1:8080/
wrk -t4 -c100 -d10s -H 'Connection: close' http://127.0.0.1:8080/

# graceful drain: start a slow connection, then Ctrl-C the proxy and
# watch it wait for in-flight before exiting.

See GUIDE.md §5 for the full hands-on lab and the kernel counters to read.

gw-01 — Verification

One command

cd gw-01-l4-data-plane && bash scripts/verify.sh

Runs go vet, go build, and go test -race. Exits 0 and prints === gw-01 OK === on success.

What the tests prove

Test	Invariant
`TestProxyForwards`	bidirectional forwarding works; `Accepted` and byte counters move during the connection (proves per-write counting / live metrics)
`TestParseProxyV1`	PROXY v1 header parses; the `bufio.Reader` still holds the payload after the header
`TestParseProxyV1Bad`	malformed PROXY headers (wrong verb, missing fields, bad family, bad IP, bad port) are rejected, not crashed on
`TestProxyProtocolEndToEnd`	end-to-end: origin recovers the real client source port from the emitted PROXY header
`TestHalfClose`	a client FIN propagates one direction only; the origin's reply still reaches the client (no premature teardown)
`TestGracefulDrain`	on cancel, in-flight is drained (force-closed at the deadline) and new dials are refused

All run under -race, so the connection-tracking map, WaitGroup, and atomic counters are checked for data races.

What "green" does NOT guarantee

No zero-copy. The data path copies through userspace; production would splice(2) the inspect-free path (GUIDE.md §6.2).
No real load characterization. Tests verify behavior, not throughput; use the GUIDE.md §5 lab with wrk for that.
No multi-origin LB. One origin only; connection-level balancing is an exercise (GUIDE.md §9) and the L4 limit that motivates gw-02/gw-06.
No Kubernetes drain ordering. The readiness-before-stop sequence needs a cluster (gw-09); the unit test only covers the in-process drain/force-close.

Manual checks worth doing

# force an accept-queue overflow and see ListenOverflows climb (GUIDE §5D)
# watch outbound connection churn with vs without keep-alive (GUIDE §5A)
# strace the process to confirm epoll/kqueue (no thread-per-conn) (GUIDE §5C)

gw-01 step 01 — A TCP proxy with backpressure and half-close

Goal

Build the smallest correct L4 TCP proxy: accept a client connection, dial an origin, copy bytes both ways with backpressure, and handle half-close so a FIN in one direction doesn't kill the other. This is the core of every proxy in the rest of the phase.

Code

src/go/proxy.go:

package main

import (
	"context"
	"io"
	"log"
	"net"
	"sync"
	"time"
)

// Proxy is a minimal L4 TCP proxy: it forwards every accepted client
// connection to a single origin address.
type Proxy struct {
	ListenAddr string
	OriginAddr string

	wg sync.WaitGroup // tracks in-flight connections (used for draining in step 03)
}

func (p *Proxy) Run(ctx context.Context) error {
	ln, err := net.Listen("tcp", p.ListenAddr)
	if err != nil {
		return err
	}
	log.Printf("listening on %s -> %s", p.ListenAddr, p.OriginAddr)

	// Close the listener when ctx is cancelled so Accept() unblocks.
	go func() { <-ctx.Done(); ln.Close() }()

	for {
		client, err := ln.Accept()
		if err != nil {
			select {
			case <-ctx.Done():
				return nil // clean shutdown
			default:
				log.Printf("accept error: %v", err)
				continue
			}
		}
		p.wg.Add(1)
		go func() {
			defer p.wg.Done()
			p.handle(client)
		}()
	}
}

func (p *Proxy) handle(client net.Conn) {
	defer client.Close()

	// Dial the origin with a bounded timeout — never block forever.
	origin, err := net.DialTimeout("tcp", p.OriginAddr, 2*time.Second)
	if err != nil {
		log.Printf("dial origin: %v", err)
		return
	}
	defer origin.Close()

	// TCP_NODELAY on both sides: disable Nagle so small request/response
	// messages aren't delayed waiting to coalesce (the 40ms-stall trap).
	setNoDelay(client)
	setNoDelay(origin)

	// Two coupled copies. io.Copy blocks on a slow Write, which stops the
	// corresponding Read: that IS backpressure, with a bounded buffer.
	var once sync.WaitGroup
	once.Add(2)
	go func() { defer once.Done(); pipe(origin, client) }() // client -> origin
	go func() { defer once.Done(); pipe(client, origin) }() // origin -> client
	once.Wait()
}

// pipe copies src -> dst, then half-closes dst's write side so the peer
// sees a FIN while the reverse direction stays open.
func pipe(dst, src net.Conn) {
	io.Copy(dst, src) // bounded 32KiB internal buffer; blocks => backpressure
	if cw, ok := dst.(interface{ CloseWrite() error }); ok {
		cw.CloseWrite() // send FIN on dst, keep reverse direction alive
	}
}

func setNoDelay(c net.Conn) {
	if tc, ok := c.(*net.TCPConn); ok {
		tc.SetNoDelay(true)
	}
}

src/go/main.go:

package main

import (
	"context"
	"flag"
	"os"
	"os/signal"
	"syscall"
)

func main() {
	listen := flag.String("listen", ":8080", "listen address")
	origin := flag.String("origin", "127.0.0.1:9000", "origin address")
	flag.Parse()

	ctx, stop := signal.NotifyContext(context.Background(),
		syscall.SIGINT, syscall.SIGTERM)
	defer stop()

	p := &Proxy{ListenAddr: *listen, OriginAddr: *origin}
	if err := p.Run(ctx); err != nil {
		os.Exit(1)
	}
}

Run it

cd gw-01-l4-data-plane/src/go
go mod init gw01 && go build -o /tmp/l4proxy .

# Terminal 1: a trivial origin (echoes uppercased lines, say) or just nc
nc -l 9000

# Terminal 2: the proxy
/tmp/l4proxy -listen :8080 -origin 127.0.0.1:9000

# Terminal 3: a client
nc 127.0.0.1 8080
# type a line; it shows up in terminal 1. Type in terminal 1; it shows
# up in terminal 3. Ctrl-D in one direction half-closes that direction
# only — the other still works.

Tasks

Implement Proxy.Run and handle as above; confirm bidirectional forwarding with nc.
Prove half-close works: in the client nc, press Ctrl-D. The origin should see EOF (FIN) but you can still type in the origin and see it at the client. A broken proxy kills both directions here.
Prove backpressure (the anti-pattern). Replace io.Copy with a version that reads into an unbounded chan []byte and writes from a second goroutine. Point a fast producer (yes | nc) at a slow consumer (an origin you kill -STOP). Watch RSS climb without bound — that's the warehouse-not-brigade bug. Revert to io.Copy.

Acceptance

Bidirectional copy works between two nc sessions through the proxy.
Ctrl-D on one side half-closes only that direction.
With io.Copy, proxy RSS stays flat under a fast-producer/slow- consumer load; with the unbounded-channel version it grows. You can explain why.

Discussion prompts

Where exactly is the bounded buffer in io.Copy, and how big is it? (Hint: io.Copy uses a 32KiB internal buffer unless src/dst implement WriterTo/ReaderFrom.)
Why dial the origin per connection here, and why is that the exact thing gw-04 fixes with pooling?
This proxy can't retry a failed origin mid-stream. Why is that a fundamental L4 limitation rather than a missing feature?

gw-01 step 02 — PROXY protocol and socket tuning

Goal

Preserve the real client IP across the proxy with the PROXY protocol v1, and set the socket options that matter for a gateway. After this step the origin can log the true client address instead of the proxy's.

Code — emit PROXY protocol v1 toward the origin

Prepend a single header line before forwarding the first byte:

import (
	"fmt"
	"net"
)

// writeProxyV1 emits the HAProxy PROXY protocol v1 header describing the
// real client and the local (proxy) address, e.g.:
//   PROXY TCP4 203.0.113.7 10.0.0.5 56324 8080\r\n
func writeProxyV1(origin net.Conn, client net.Conn) error {
	src := client.RemoteAddr().(*net.TCPAddr)
	dst := client.LocalAddr().(*net.TCPAddr)

	fam := "TCP4"
	if src.IP.To4() == nil {
		fam = "TCP6"
	}
	header := fmt.Sprintf("PROXY %s %s %s %d %d\r\n",
		fam, src.IP.String(), dst.IP.String(), src.Port, dst.Port)
	_, err := origin.Write([]byte(header))
	return err
}

Call it once, right after dialing the origin and before the copy loop:

	if p.SendProxyHeader {
		if err := writeProxyV1(origin, client); err != nil {
			log.Printf("proxy header: %v", err)
			return
		}
	}

Code — parse PROXY protocol v1 on the receiving side

The origin (or a downstream proxy) reads and strips the header before treating the rest as payload:

import (
	"bufio"
	"errors"
	"net"
	"strconv"
	"strings"
)

type ClientInfo struct {
	SrcIP   net.IP
	SrcPort int
}

// readProxyV1 consumes a PROXY v1 header from r and returns the real
// client address. br must be used for all subsequent reads (it may hold
// buffered payload bytes).
func readProxyV1(br *bufio.Reader) (ClientInfo, error) {
	line, err := br.ReadString('\n')
	if err != nil {
		return ClientInfo{}, err
	}
	line = strings.TrimRight(line, "\r\n")
	f := strings.Split(line, " ")
	// f = ["PROXY","TCP4","203.0.113.7","10.0.0.5","56324","8080"]
	if len(f) != 6 || f[0] != "PROXY" {
		return ClientInfo{}, errors.New("bad PROXY v1 header")
	}
	port, _ := strconv.Atoi(f[4])
	return ClientInfo{SrcIP: net.ParseIP(f[2]), SrcPort: port}, nil
}

Trust boundary (say this in an interview): only parse the PROXY header from peers you trust. If an untrusted client can send it, they can spoof their source IP. In production you allowlist the proxy's IPs as trusted PROXY-protocol senders.

Socket options — what to set and why

Set this	How (Go)	Why on a gateway
`TCP_NODELAY`	`tc.SetNoDelay(true)`	disable Nagle; avoid 40ms small-message stalls
`SO_KEEPALIVE`	`tc.SetKeepAlive(true)` + `tc.SetKeepAlivePeriod(30*time.Second)`	detect dead peers on idle/long-lived conns (vital for gw-04/gw-05)
`SO_REUSEPORT`	via `net.ListenConfig{Control: setReusePort}` (syscall)	run one acceptor per core, no accept-lock contention
read/write deadlines	`c.SetDeadline(time.Now().Add(idle))`	bound idle and half-dead connections

SO_REUSEPORT control hook:

import (
	"context"
	"syscall"
	"golang.org/x/sys/unix"
)

func setReusePort(network, address string, c syscall.RawConn) error {
	return c.Control(func(fd uintptr) {
		unix.SetsockoptInt(int(fd), unix.SOL_SOCKET, unix.SO_REUSEPORT, 1)
	})
}

// usage:
lc := net.ListenConfig{Control: setReusePort}
ln, err := lc.Listen(context.Background(), "tcp", p.ListenAddr)

Now you can start N copies of the proxy on the same port; the kernel spreads new connections across them.

Tasks

Add -proxy-header flag; when set, emit PROXY v1 toward the origin.
Write a tiny test origin that calls readProxyV1 and logs the real client IP. Confirm it sees the client's address, not the proxy's.
Enable keep-alive and SO_REUSEPORT. Start two proxy instances on :8080; hammer with wrk and confirm both accept connections (ss -tnp | grep 8080).

Acceptance

Origin logs the genuine client IP/port via the PROXY header.
A malformed PROXY header is rejected (your parser returns an error, doesn't crash).
Two SO_REUSEPORT instances share :8080; load is spread across both.

Discussion prompts

Why is PROXY protocol the L4 answer but X-Forwarded-For the L7 answer to the same problem? What can't you do at L4?
v2 is binary and carries TLVs (e.g. the negotiated TLS SNI). When would the extra complexity of v2 pay off over v1?
SO_REUSEPORT spreads new connections, not load. With long-lived connections (gw-05) one instance can still end up hot. Why, and what do you do about it?

gw-01 step 03 — Graceful drain, load testing, and reading the kernel

Goal

Make the proxy drain on SIGTERM without dropping in-flight connections, then load-test it and read the kernel counters that prove it's healthy. This is the operational core of the role: every deploy is a drain.

Code — drain with a deadline

import (
	"context"
	"log"
	"time"
)

// RunWithDrain runs the proxy and, on ctx cancellation (SIGTERM),
// stops accepting, then waits up to drainTimeout for in-flight
// connections to finish before returning.
func (p *Proxy) RunWithDrain(ctx context.Context, drainTimeout time.Duration) error {
	runErr := make(chan error, 1)
	go func() { runErr <- p.Run(ctx) }() // Run closes the listener on ctx.Done

	select {
	case err := <-runErr: // listener returned (ctx cancelled or fatal)
		// Now drain: wait for in-flight handlers tracked by p.wg.
		done := make(chan struct{})
		go func() { p.wg.Wait(); close(done) }()

		select {
		case <-done:
			log.Printf("drain complete: all connections closed")
		case <-time.After(drainTimeout):
			log.Printf("drain timeout after %s: forcing exit with stragglers", drainTimeout)
		}
		return err
	}
}

Wire the ordering correctly in main:

	// 1. SIGTERM cancels ctx -> Run stops accepting + closes listener.
	// 2. (Kubernetes preStop already flipped readiness; the LB has
	//    stopped sending NEW connections — see gw-09.)
	// 3. RunWithDrain waits for in-flight, bounded by drainTimeout.
	_ = p.RunWithDrain(ctx, 25*time.Second)

Ordering is everything. Readiness must fail before you stop accepting, so the load balancer removes you from rotation while you can still serve the connections you already have. In Kubernetes the preStop hook + terminationGracePeriodSeconds give you that window (gw-09). Exit too early and every deploy drops requests.

Load test

# Build, run an HTTP origin so wrk can talk through the L4 proxy:
python3 -m http.server 9000 &           # origin
/tmp/l4proxy -listen :8080 -origin 127.0.0.1:9000 &

# Throughput + latency:
wrk -t4 -c200 -d30s http://127.0.0.1:8080/

# Constant-rate (wrk2) to see real tail latency under a fixed load:
wrk2 -t4 -c200 -d30s -R20000 http://127.0.0.1:8080/

Read the kernel — is it healthy?

# Accept-queue depth on the LISTEN socket: Recv-Q = current, Send-Q = max.
# If Recv-Q rides near Send-Q, your accept loop can't keep up.
ss -lnt 'sport = :8080'

# Listen-queue overflows + SYN drops (these should stay ~0):
nstat -az | grep -Ei 'ListenOverflows|ListenDrops|TCPReqQFullDrop'

# Per-socket TCP info (rtt, cwnd, retrans) for established conns:
ss -tni 'sport = :8080' | head

# Are we accumulating TIME_WAIT on the origin (outbound) side?
ss -tan state time-wait | wc -l

# Watch what the proxy is actually doing at the syscall level — you'll
# see epoll_pwait + accept4 + read/write; Go's netpoller IS an event loop:
strace -f -e trace=network,epoll_pwait -p "$(pgrep l4proxy)" 2>&1 | head -40

Tasks

Implement RunWithDrain. Start a long-lived connection (nc), then kill -TERM the proxy. Confirm the existing connection keeps working until it closes (or the deadline), while a new nc connection is refused immediately.
Drive wrk2 at a fixed rate and record p50/p99/p99.9. Then add a 2nd SO_REUSEPORT instance and show the tail improves (less accept-loop contention).
Force an accept-queue overflow: set the origin to kill -STOP so handlers pile up, lower the listen backlog, blast wrk -c2000, and watch ListenOverflows climb in nstat. Explain the client-side symptom (connection timeouts / resets).

Acceptance

On SIGTERM, in-flight connections complete (or hit the deadline); new connections are refused. No mid-stream drops in the wrk run that straddles the signal.
You can point at ss/nstat output and say whether the proxy is accept-queue-bound, origin-bound, or healthy.
You can reproduce a ListenOverflows spike on demand and explain it.

Discussion prompts

Why must readiness fail before you stop accepting, not after? Draw the timeline of LB endpoint removal vs your shutdown.
A 25s drain deadline force-closes stragglers. For an L7 HTTP node that's usually fine; for a WebSocket node holding 200k connections (gw-05) it's catastrophic. What changes about drain at that scale?
wrk opens a fixed connection pool, so it never stresses your accept path after warmup. What load shape does stress accept, and how would you generate it? (Hint: many short connections — the very thing gw-04 eliminates with keep-alive.)

gw-02 — L7 Protocols: HTTP/1.1, HTTP/2, gRPC, HTTP/3

Where gw-01 forwards an opaque byte stream, this lab terminates the application protocol: it parses HTTP, understands requests and responses, and can therefore route, retry, rate-limit, and rewrite per request instead of per connection. That is the entire reason an L7 gateway exists. The JD asks for "deep expertise in L7 (HTTP/S, gRPC, WebSockets)" — this lab covers HTTP/1.1, HTTP/2, gRPC, and HTTP/3; WebSockets get their own lab (gw-05) because the persistent-connection proxy is a distinct beast at Netflix (Pushy).

You will hand-write an HTTP/2 frame parser and demonstrate stream multiplexing, HPACK, and flow control — the three things that make h2 fundamentally different from h1 and that every gateway engineer must be able to reason about.

1. What is it?

Layer 7 is the application layer: the protocol carries semantic messages (an HTTP request with a method, path, headers, body), not just bytes. An L7 proxy decodes those messages, which unlocks:

per-request routing (path/header/method-based), not per-connection;
retries and hedging (you know where a request starts and ends);
header manipulation, auth, rate limiting (you can read/modify the message);
observability per request (status code, latency, route).

The cost is real: you must parse, you hold message state, you can't zero-copy splice, and you inherit every protocol's quirks. The four protocols you must know:

HTTP/1.1   text, one request per connection at a time (pipelining is dead)
HTTP/2     binary frames, many streams multiplexed over one TCP conn
gRPC       RPC framing on top of HTTP/2 (length-prefixed protobuf messages)
HTTP/3     HTTP semantics over QUIC over UDP — fixes TCP head-of-line blocking

2. Why does it matter?

The gateway's value lives at L7. Routing 1M+ rps to ~100 backend clusters (the Zuul number) is a per-request decision. Retries, circuit breaking, and header-based canary routing (gw-06, gw-12) are all L7. If you can't parse the protocol you can't do the job.
HTTP/2 multiplexing breaks naive load balancing. With h1, one connection carries one request at a time, so an L4 LB spreading connections roughly spreads load. With h2, one long-lived connection carries hundreds of concurrent streams — so an L4 LB pins all of them to one backend. This is why gRPC (which is h2) needs L7 / per- request LB, and it's a favorite interview trap.
gRPC is the internal-RPC default at cloud-native shops. A gateway that proxies gRPC must understand h2 framing, the grpc-status trailer, and the difference between transport errors and gRPC status codes. Many "my gRPC load balancing is broken" incidents are really "an h1-era L4 LB in the path."
HTTP/3 / QUIC is the edge frontier. It moves the transport into userspace over UDP to kill TCP-level head-of-line blocking and speed up connection setup (0-RTT). The edge team will care because it changes everything about connection management, NAT, and load balancing (UDP flows, connection IDs that survive IP changes).

3. How does it work?

HTTP/1.1 — text, and the framing problem

A request is text headers + optional body. The only hard part is knowing where the body ends:

POST /v1/play HTTP/1.1\r\n
Host: api.netflix.com\r\n
Content-Length: 27\r\n          ← body length known up front
\r\n
{"title":"Stranger Things"}

…or, when length isn't known up front, chunked transfer encoding:

Transfer-Encoding: chunked\r\n
\r\n
1b\r\n                          ← next chunk is 0x1b = 27 bytes
{"title":"Stranger Things"}\r\n
0\r\n                           ← zero-length chunk = end
\r\n

Two ways to delimit a body → the source of request smuggling vulnerabilities when a front proxy and a back proxy disagree about which one wins (CL.TE / TE.CL). A gateway must normalize this. Pipelining (multiple requests in flight on one h1 connection) exists in the spec but is effectively dead because of head-of-line blocking; real h1 reuse is serial keep-alive (one at a time, connection stays open — the thing gw-04 pools).

HTTP/2 — binary frames over one connection

h2 multiplexes many logical streams over one TCP connection using binary frames. Every frame shares a 9-byte header:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                 Length (24 bits)              |   Type (8)    |
+---------------+---------------+---------------+---------------+
|   Flags (8)   |R|         Stream Identifier (31)             |
+---------------+-----------------------------------------------+
|                   Frame Payload (Length bytes)               ...

Frame types you must know: HEADERS (start a request/response, carries HPACK-compressed headers), DATA (body), SETTINGS (connection config exchanged at startup), WINDOW_UPDATE (flow control credit), RST_STREAM (cancel one stream), GOAWAY (graceful connection shutdown — critical for draining, gw-01/gw-04), PING, PRIORITY, PUSH_PROMISE (server push, now deprecated).

Three mechanisms make h2 different and are the interview core:

Multiplexing. Each request/response is a stream with an odd (client-initiated) ID. Frames from many streams interleave on the wire, demuxed by Stream Identifier. One connection, hundreds of concurrent requests — no h1 connection-pool churn (the gw-04 win is partly "use h2 to the origin").
HPACK header compression (RFC 7541). Headers repeat enormously (:authority, user-agent, cookies). HPACK keeps a dynamic table of previously-seen header fields on each side; a header can be sent as a 1-byte index into that table. Stateful, per-connection, and the source of subtle bugs when a proxy rewrites headers (you must keep the encoder/decoder tables consistent). The static table is 61 predefined common headers.
Flow control. Per-stream and per-connection credit windows (WINDOW_UPDATE). A receiver advertises how many bytes it'll accept; the sender must not exceed it. This is application-level backpressure layered on top of TCP's — and a misconfigured window is a classic throughput bug (a slow consumer stalls a stream, or tiny windows cap throughput).

The HOL-blocking subtlety: h2 fixes application-level head-of-line blocking (a slow stream no longer blocks others at the h2 layer) but NOT transport-level HOL — because it's all one TCP connection, a single lost TCP segment stalls every stream until retransmit. That residual problem is exactly what HTTP/3 solves.

gRPC — RPC on top of h2

gRPC is a convention over HTTP/2:

:method = POST
:path   = /package.Service/Method
content-type = application/grpc
DATA frames carry: [1-byte compressed-flag][4-byte length][protobuf message]...
trailers (a second HEADERS frame) carry: grpc-status, grpc-message

Key gateway facts: gRPC uses trailers for status, so an HTTP 200 can still be a gRPC error (grpc-status: 14 = UNAVAILABLE). Streaming RPCs keep the stream open for many messages. Load balancing gRPC requires L7 / per-request distribution (the multiplexing trap above).

HTTP/3 / QUIC — HTTP over UDP

QUIC (RFC 9000) reimplements reliable, ordered, multiplexed, encrypted streams in userspace on top of UDP, with TLS 1.3 built into the handshake. HTTP/3 (RFC 9114) is HTTP semantics over QUIC, with QPACK (RFC 9204) as the HOL-blocking-safe replacement for HPACK. What changes for a gateway:

No TCP HOL blocking — each QUIC stream has independent loss recovery, so one lost packet stalls only its own stream.
Faster setup — 1-RTT, or 0-RTT for resumed connections.
Connection migration — a QUIC connection is identified by a Connection ID, not the 4-tuple, so it survives a client changing networks (Wi-Fi→cellular) without a new handshake. Great for mobile (Netflix devices); painful for L4 LBs that hash the 4-tuple.
UDP everywhere — load balancing, NAT traversal, and DDoS posture all change; many middleboxes still mishandle UDP.

4. Core terminology

Term	Definition
L7 termination	Decoding the application protocol so you can act per-request.
Keep-alive	Reusing one connection for many sequential requests (h1) — the thing gw-04 pools.
Chunked encoding	h1 body framed as length-prefixed chunks when total length isn't known.
Request smuggling	Front/back proxies disagreeing on body framing (CL.TE/TE.CL); a security bug.
Stream (h2/h3)	One independent request/response within a multiplexed connection.
Frame	The h2 unit of transmission: 9-byte header + typed payload.
HPACK / QPACK	Header compression for h2 / h3 (QPACK avoids HOL blocking).
Flow control	Per-stream/connection credit windows; application-level backpressure.
GOAWAY	h2/h3 frame telling the peer to stop opening streams; the graceful-drain signal.
Trailers	Headers sent after the body; how gRPC conveys status.
Head-of-line blocking	A stalled item blocking those behind it: at L4 (TCP), at h1 (pipelining), at h2 (TCP under multiplexing), fixed at h3.
QUIC Connection ID	Identifies a QUIC connection independent of IP/port; enables migration.
0-RTT	Sending request data in the first flight on a resumed TLS1.3/QUIC connection.

5. Mental models

h1 is one phone line; h2 is a switchboard. With h1 keep-alive you reuse the line but must finish one call before the next. h2 runs many calls on one line at once, tagged by stream ID. h3 gives each call its own line so a dropped packet on one call doesn't mute the others.
HPACK is a shared shorthand the two parties build as they talk. The first time you say a long header you spell it out and both sides write it down; after that you just say "row 7." Lose sync on the notebook (a proxy that rewrites headers carelessly) and you start decoding garbage.
Flow control is a debit card, not unlimited credit. The receiver hands the sender a balance (the window); every DATA byte is a charge; WINDOW_UPDATE is a top-up. Forget to top up and the sender stops — a stall that looks like a network problem but is your bookkeeping.
gRPC over an L4 LB is a party line. All the multiplexed calls pin to whichever backend the single connection landed on, so one backend gets hammered while others idle. The fix is to balance the calls, not the connection — i.e., L7.

6. Common misconceptions

"HTTP/2 is faster because it's binary." Binary framing helps, but the real wins are multiplexing (no h1 connection churn / no HOL at the h1 layer) and header compression. And h2 adds a problem (TCP HOL across streams) that h3 had to fix.
"HTTP/2 needs fewer connections, so any LB is fine." The opposite: because h2 uses one long-lived connection with many streams, an L4 LB pins all that load to one backend. h2/gRPC make L7 LB more necessary, not less.
"gRPC errors show up as HTTP 5xx." No — gRPC status rides in trailers; you can get HTTP 200 with grpc-status: 14. A gateway measuring only HTTP status will report a broken backend as healthy.
"HTTP/3 is just HTTP/2 over UDP." It re-implements the whole reliable-transport layer (QUIC), changes header compression (QPACK), adds connection migration, and folds TLS into the handshake. It's a new transport, not a swap of L4.
"Chunked encoding is legacy." It's how you stream a response of unknown length (Server-Sent Events, streaming JSON, gRPC-web). And its interaction with Content-Length is a live security surface (smuggling) every gateway must normalize.

7. Interview talking points

"What actually changed from HTTP/1.1 to HTTP/2 to HTTP/3?" Give the three-line answer: h2 = multiplexed binary streams + HPACK + flow control over one TCP conn (fixes h1 HOL and connection churn); h3 = the same over QUIC/UDP (fixes the residual TCP HOL, adds 0-RTT and connection migration). Name the cost each introduced.
"Why does gRPC load balancing break behind a classic LB?" h2 multiplexes all calls over one connection; an L4 (connection-level) LB pins them to one backend. You need L7/request-level LB, or a proxy that re-balances per request, or client-side LB with periodic reconnects (MAX_CONNECTION_AGE).
"Walk me through HTTP/2 flow control." Two levels (stream + connection), credit advertised via SETTINGS_INITIAL_WINDOW_SIZE and topped up via WINDOW_UPDATE; sender blocks at zero. It's application backpressure on top of TCP's. A too-small window caps throughput on high-BDP links; a stuck consumer stalls a stream.
"How do you drain an HTTP/2 connection?" GOAWAY with the last-processed stream ID: the peer finishes in-flight streams, opens no new ones, then closes. Tie it back to gw-01 draining — for h2 the drain signal is in-band, which is better than h1 (where you rely on Connection: close per response).
"What's request smuggling and how does a gateway prevent it?" Front and back proxies disagreeing on body length (Content-Length vs Transfer-Encoding). Defense: reject ambiguous framing, prefer one encoding, normalize before forwarding, and run identical parsing semantics front-to-back.
"Why is HPACK stateful, and why does that matter for a proxy?" The dynamic table is per-connection shared state between encoder and decoder. A proxy that mutates headers must update its own encoder/decoder consistently, and must bound the table size (SETTINGS_HEADER_TABLE_SIZE) to avoid an HPACK-bomb DoS.
"When would you push HTTP/3 to the edge?" Mobile/lossy networks (Netflix devices on cellular) where TCP HOL and reconnect cost hurt; weigh against UDP middlebox issues, higher CPU for userspace transport, and the operational cost of new LB/observability tooling.

8. Connections to other labs

gw-01 (L4) forwards the bytes this lab parses; the moment you terminate L7 you lose splice zero-copy.
gw-03 (API gateway) runs filters on the parsed request/response this lab produces — routing, auth, rewrite all assume L7 decoding.
gw-04 (connection management) uses h2 multiplexing to the origin as one way to curb connection churn; GOAWAY is the drain primitive.
gw-05 (WebSockets) is the other L7 upgrade path (Upgrade: websocket from h1, or CONNECT over h2) — a persistent connection, not a request/response.
gw-06 (resilience) depends on per-request boundaries (retries, hedging) that only L7 gives you, and must understand gRPC trailers to classify failures correctly.
gw-08 (Envoy/xDS) — Envoy's HTTP connection manager is a production version of this lab's parser + filter chain.

gw-02 — The Hitchhiker's Guide to L7 Protocols

Companion to CONCEPTS.md, with the runnable HTTP/2 toolkit in src/go/h2/. You will decode the same bytes nghttp -v shows — but from first principles, so you can debug an h2 problem nobody else on the team can.

L7 is where a proxy stops being a dumb pipe (gw-01) and starts understanding traffic: routing per request, retrying, rewriting, observing. The price is that you must parse the protocol, and HTTP/2 is where most engineers' mental model is fuzziest. This guide makes it concrete by building the three mechanisms that make h2 different — multiplexed binary framing, HPACK header compression, and flow control — and then connecting them to the real-world traps: the gRPC load-balancing pitfall and HTTP/3.

1. The frame layer (frame.go)

Everything in h2 is a frame: a 9-byte header + payload.

+-----------------------------------------------+
| Length (24)                                   |
+---------------+---------------+---------------+
| Type (8)      | Flags (8)     |
+-+-------------+---------------+---------------+
|R| Stream Identifier (31)                      |
+=+=============================================+
| Frame Payload (Length bytes)                ...

ReadFrame is a faithful decode: read 9 bytes, pull the 24-bit length, the type, the flags, and the 31-bit stream id (masking the reserved high bit — forgetting that mask is a classic bug that yields astronomical stream IDs). WriteFrame is the inverse, so the package round-trips (see TestFrameRoundTrip).

The teachable property is multiplexing: frames for many streams interleave on one connection, demultiplexed by StreamID. TestReadFramesMultiplex writes HEADERS for stream 1, HEADERS for stream 3, then DATA for each — interleaved — and reads them back in order. One TCP connection, two concurrent requests. That is the entire reason h2 killed the h1 connection-pool churn problem (and why gw-04's churn work leans on h2 to the origin).

Frame types to know by heart: HEADERS (start a request/response), DATA (body), SETTINGS (startup config), WINDOW_UPDATE (flow-control credit), RST_STREAM (cancel one stream), GOAWAY (graceful connection drain — the h2 equivalent of gw-01's connection drain, but in-band), and CONTINUATION (header block spillover — and the vector for the "Rapid Reset"/CONTINUATION-flood DoS class).

MaxFrameLen guards the 24-bit length so a hostile peer can't make you allocate gigabytes — the kind of defensive detail a maintainer adds after the first fuzzing report.

2. HPACK (hpack.go) — the stateful one

Headers repeat enormously (:authority, user-agent, cookies on every request). HPACK compresses them with two tables:

a static table of 61 common header fields (RFC 7541 Appendix A), fully reproduced in staticTable;
a per-connection dynamic table that both sides build as they talk: the first time you send a header it's spelled out and remembered; after that it's a one-byte index.

Integer coding (the part everyone gets subtly wrong)

HPACK integers use an N-bit prefix then continuation bytes (§5.1). EncodeInt/decodeInt implement it; TestEncodeDecodeInt checks the RFC §C.1 vectors exactly: 10 in a 5-bit prefix is 0x0a; 1337 is 0x1f 0x9a 0x0a; 42 in an 8-bit prefix is 0x2a. The continuation- byte loop has an overflow guard — an unbounded integer is a DoS.

The dynamic table, proven against the RFC

TestHpackRFCRequestExamples decodes the actual RFC 7541 §C.3.1 and §C.3.2 request blocks with one Decoder instance. The second request uses byte 0xbe — an index into the dynamic table — to refer to the :authority: www.example.com that the first request added. If the decoder didn't persist its table across HEADERS frames, this would decode to garbage. That is the single most important HPACK fact for a proxy engineer:

The dynamic table is per-connection shared state between the encoder and decoder. If your gateway rewrites a header (adds x-forwarded-for, strips a hop-by-hop header), it must re-encode with its OWN encoder whose table tracks what it actually sent to the origin — never blindly forward the client's HPACK block. Desync the table and every subsequent header on that connection corrupts.

TestHpackDynamicEviction proves the size accounting and eviction (each entry costs len(name)+len(value)+32 bytes; over maxSize the oldest entries are evicted). Bounding maxSize is your defense against an HPACK-table-bloat DoS.

Huffman, honestly

Real captures Huffman-code header strings. This lab decoder handles integers, literals, and the dynamic table — everything that teaches the mechanics — and surfaces Huffman explicitly (ErrHuffman, TestHpackHuffmanSurfaced) rather than guessing. For production header parsing you use golang.org/x/net/http2/hpack, which includes the 257-symbol Huffman table. The lesson here is the table mechanics; the Huffman codec is a (large, mechanical) lookup you should know exists, not re-implement under interview pressure.

3. Flow control (flow.go) — backpressure, again

h2 adds application-level flow control on top of TCP's: two credit windows, one per stream and one per connection. A sender may transmit DATA only while both are positive; each byte debits both; a WINDOW_UPDATE credits one.

TestFlowControl walks the subtle part: drain a stream to zero and a WINDOW_UPDATE on the stream alone is not enough if the connection window is also zero — you must credit both. That two-level model is why a single greedy stream can't starve the others, and why a misconfigured SETTINGS_INITIAL_WINDOW_SIZE is a throughput bug on high-BDP (bandwidth × delay) links: the window caps in-flight bytes ≈ throughput × RTT, so a too-small window throttles a fast, far link.

This is the same backpressure principle as gw-01's coupled copy, but expressed as explicit credits in the protocol instead of a blocking Write. Recognizing that "flow control" at L4 (TCP receive window), L7 (h2 WINDOW_UPDATE), and in your proxy's buffers (gw-01) are the same idea at three layers is a senior insight.

4. The demux (demux.go) — multiplexing made visible

Demux reconstructs per-stream state from an interleaved frame stream: stream phase (idle → open → half-closed → closed), header count, data bytes, and flow-control debits. TestDemuxMultiplexing feeds HEADERS+ DATA for streams 1 and 3 and asserts ActiveStreams() == [1, 3] — two requests concurrently in flight on one connection — and that an END_STREAM flag moves a stream to half-closed.

The CLI h2inspect wraps this: pipe it a raw h2 byte stream and it prints a frame trace + decoded headers + the multiplexed stream set, the same view as nghttp -v but from your own parser.

5. The gRPC load-balancing trap (the interview favorite)

gRPC is just HTTP/2 with conventions: :path = /package.Service/Method, content-type: application/grpc, length-prefixed protobuf in DATA frames, and — critically — status in trailers (a second HEADERS frame after the body carrying grpc-status). So HTTP 200 + grpc-status 14 is a failed call. A gateway that classifies success by :status alone reports a broken backend as healthy (this matters for gw-06 retries and gw-11 SLOs).

The trap: h2 multiplexes all calls over one long-lived connection, so a connection-level (L4, gw-01) load balancer pins every call to one backend. You need L7/per-request balancing (gw-06), or client-side LB with one subchannel per backend, or MAX_CONNECTION_AGE to force periodic reconnects — which reintroduces exactly the connection churn gw-04 fights. steps/03 reproduces the skew. Be ready to draw this on a whiteboard.

6. HTTP/3 in one breath

h2 fixed h1's application-level head-of-line blocking but not the transport-level kind: it's all one TCP connection, so one lost segment stalls every stream until retransmit. HTTP/3 moves the transport into userspace over UDP (QUIC): independent per-stream loss recovery (no TCP HOL), TLS 1.3 folded into the handshake (1-RTT, 0-RTT resume), and connection migration (a QUIC Connection ID, not the 4-tuple, identifies the connection — so a phone switching Wi-Fi→cellular keeps its connection). For the edge this is huge for mobile (Netflix devices) and painful for L4 LBs that hash the 4-tuple. Header compression becomes QPACK (HOL-blocking-safe). You won't implement QUIC here, but you must be able to say why it exists: to kill the residual TCP HOL that h2's single connection couldn't.

7. Hands-on

cd src/go
bash ../scripts/verify.sh   # or: go test -race ./...

# Build a header block in code and watch it decode (a tiny program, or
# adapt TestHpackEncodeDecodeRoundTrip). Then inspect a real stream:
go build -o /tmp/h2inspect ./cmd/h2inspect

# Generate a cleartext h2c capture with curl/nghttp and pipe it in:
#   nghttp -v --no-tls http://127.0.0.1:9000/ -m 4   # multiplex 4 reqs
# (real headers are Huffman-coded; h2inspect prints frames + marks
#  [huffman] for header blocks — the framing/demux still works.)

8. Exercises

Add Huffman decode for the static-table-only case using golang.org/x/net/http2/hpack behind a build tag, and feed h2inspect a real curl --http2-prior-knowledge capture. Confirm your frame trace matches nghttp -v.
Reproduce the gRPC LB skew (steps/03): one channel, 1000 calls, count per-backend hits through an L4 LB vs an L7 path. Quantify it.
Make a too-small window throttle throughput: lower the initial window in FlowController, simulate a high-RTT link by delaying WINDOW_UPDATEs, and show throughput ≈ window/RTT.
Detect a trailers-only failure: extend Demux to capture the second (trailer) HEADERS frame and surface grpc-status; prove a 200-but-failed call is detectable. This is what gw-11 needs.
Implement GOAWAY draining: track the last-processed stream id, reject new streams after a GOAWAY, and let in-flight finish — the h2 analog of gw-01's connection drain.

gw-02 — References

Protocol specs (read the framing sections in full)

RFC 9110 — HTTP Semantics. RFC 9112 — HTTP/1.1 syntax (the message-framing/Content-Length/chunked rules).
RFC 9113 — HTTP/2. Read §4 (frames), §5 (streams/multiplexing), §6 (frame types), §6.9 (flow control), §6.8 (GOAWAY).
RFC 7541 — HPACK. The static + dynamic table model.
RFC 9114 — HTTP/3. RFC 9000 — QUIC transport. RFC 9204 — QPACK. RFC 9001 — QUIC + TLS 1.3.
gRPC over HTTP/2 — the wire spec. https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md

Implementations to read alongside

Go stdlib net/http2 (golang.org/x/net/http2) — a clean, readable h2 implementation: frame.go, hpack/, the flow-control accounting. This lab's parser is a teaching subset of it.
nghttp2 (C) — the reference h2/h3 library; nghttp/h2load tools.
Envoy's http_connection_manager and codecs — production L7 parsing
- filter chain (gw-08).
quic-go (Go) and Cloudflare quiche (Rust) — readable QUIC/h3.

Background & war stories

PortSwigger, HTTP request smuggling — the canonical CL.TE/TE.CL writeups. https://portswigger.net/web-security/request-smuggling
Cloudflare / Fastly blogs on HTTP/2 and HTTP/3 at the edge, 0-RTT replay risk, and QUIC connection migration.
High Performance Browser Networking (Ilya Grigorik) — free online; the best single explanation of h1/h2/QUIC tradeoffs. https://hpbn.co/
gRPC blog, gRPC Load Balancing — why L4 LB breaks h2/gRPC and the options (look-aside LB, proxyless, MAX_CONNECTION_AGE).

Tooling

curl --http2 -v, curl --http3 -v — see the protocol negotiation.
nghttp -v, h2load — h2 client + load tester with frame traces.
grpcurl — call gRPC services from the shell.
Wireshark with TLS keylog (SSLKEYLOGFILE) — decode h2/h3 frames.

Cross-lab dependencies

Upstream: gw-01 (the byte stream this lab terminates).
Downstream: gw-03 (filters over parsed requests), gw-04 (h2 to origin), gw-05 (the WebSocket upgrade path), gw-06 (per-request retries/hedging + gRPC status classification), gw-08 (Envoy's codec).

gw-02 — Analysis

The design questions behind terminating L7, and what to be ready to defend in a review.

Required behaviors of an L7 terminator

Unambiguous framing. Exactly one body-delimitation rule per message. Reject messages that specify both Content-Length and Transfer-Encoding: chunked (smuggling surface), or normalize to one before forwarding.
Correct h2 demux. Frames from interleaved streams must be routed to the right per-stream state machine by Stream Identifier; stream IDs are monotonic and parity-typed (odd=client, even=server).
Flow-control accounting. Track per-stream and per-connection windows; never send DATA beyond the peer's advertised window; emit WINDOW_UPDATE as you consume.
HPACK consistency. Encoder and decoder dynamic tables must stay in lockstep across header rewrites; bound the table size.
Graceful close. Use GOAWAY (h2) to drain rather than abruptly closing the TCP connection mid-stream.

Design decisions in the lab parser

Parse, don't fully implement. The steps/ build a read-only h2 frame parser: it decodes the 9-byte header, classifies frame types, and tracks per-stream state — enough to reason about multiplexing, HPACK, and flow control, without re-implementing a production codec. The real thing is golang.org/x/net/http2; the lab's job is to make its mechanics legible.
Static-table HPACK first, dynamic table second. We decode using the 61-entry static table and literal headers before adding the dynamic table, because the dynamic table is where the subtle state-sync bugs live and you want to see the table grow.
Surface the multiplexing trap experimentally. A step runs many concurrent gRPC calls over one h2 connection through an L4 vs an L7 balancer and shows the load distribution differ — turning the interview talking point into a reproduced result.

Tradeoffs worth flagging

L7 termination forfeits zero-copy. The moment you parse, you copy through userspace and hold per-message state. That's the price of routing/retries/auth. For pure pass-through (e.g., opaque TLS passthrough by SNI) you stay L4 and keep splice (gw-07).
h2 to the origin is a double-edged sword. One multiplexed connection per origin curbs connection churn (gw-04) and removes h1 HOL — but concentrates load and re-introduces TCP HOL; a single packet loss stalls all streams to that origin. You trade connection efficiency for blast-radius-per-connection.
HTTP/3's UDP changes your whole operational model. New LB hashing (connection ID, not 4-tuple), new DDoS posture, new observability (no TCP counters), userspace CPU cost. Often worth it at the client edge (mobile), rarely worth it origin-side first.
Trailers complicate everything. gRPC status in trailers means your access logs, metrics, and retry logic must read past the body. A gateway that classifies success by HTTP status alone will misreport gRPC health (gw-06, gw-11).

What production adds beyond this lab

A full, fuzzed codec (h2 frame parsers are a rich attack surface: HPACK bombs, CONTINUATION floods, RST_STREAM floods → the "Rapid Reset" CVE-2023-44487 class of DoS).
Protocol downgrade/upgrade bridging (h2 client ↔ h1 origin, h3 client ↔ h2 origin) with correct semantics for trailers, 1xx, and Expect/Continue.
0-RTT replay protection for h3 (non-idempotent requests must not be accepted in early data).
Per-route protocol policy (force h2 to this cluster, h1 to that one) driven by the control plane (gw-08).

gw-02 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).
Optional: nghttp/curl --http2 to generate real captures.

One-shot

cd gw-02-l7-protocols
bash scripts/verify.sh        # vet + build + test -race  → "=== gw-02 OK ==="

Per-language workflow (Go)

cd gw-02-l7-protocols/src/go
go test -race -count=1 ./...      # 10 tests in package h2
go build -o /tmp/h2inspect ./cmd/h2inspect

Use the frame inspector

# Pipe any raw HTTP/2 byte stream (h2c / prior-knowledge) into it:
/tmp/h2inspect -f capture.bin
# or:  some-h2-source | /tmp/h2inspect

It prints a frame-by-frame trace, decodes non-Huffman HEADERS, marks Huffman header blocks, and lists the multiplexed stream set.

What's in the package

File	What
`h2/frame.go`	frame read/write, types, flags, multiplex-friendly parsing
`h2/hpack.go`	HPACK integer/string coding, static+dynamic table, encoder
`h2/flow.go`	two-level (stream+connection) flow-control accounting
`h2/demux.go`	per-stream state machine = multiplexing made visible
`cmd/h2inspect`	CLI frame/header inspector

See GUIDE.md for the full walkthrough and the gRPC LB-trap and HTTP/3 deep dives.

gw-02 — Verification

One command

cd gw-02-l7-protocols && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestEncodeDecodeInt`	HPACK integer coding matches RFC 7541 §C.1 vectors exactly (10/5-bit, 1337/5-bit, 42/8-bit)
`TestFrameRoundTrip`	frame write→read preserves type/flags/stream/payload; END_STREAM decoded
`TestReadFramesMultiplex`	interleaved frames for streams 1 & 3 parse in order (multiplexing)
`TestHpackStaticIndexed`	static-table indexed field decodes (`0x82` → `:method GET`)
`TestHpackRFCRequestExamples`	RFC 7541 §C.3.1 + §C.3.2 decode correctly with a persistent dynamic table (the `0xbe` dynamic reference)
`TestHpackEncodeDecodeRoundTrip`	encoder→decoder round-trips; a literal becomes dynamically indexable
`TestHpackDynamicEviction`	dynamic-table size accounting + eviction (`len+len+32`, evict oldest over max)
`TestHpackHuffmanSurfaced`	Huffman strings surface `ErrHuffman` (no silent garbage)
`TestFlowControl`	two-level windows: stream-only `WINDOW_UPDATE` is insufficient while the conn window is 0
`TestDemuxMultiplexing`	per-stream phases + concurrent stream set reconstructed from an interleaved trace

All under -race.

What "green" does NOT guarantee

No Huffman codec. Real captures Huffman-code headers; this decoder surfaces that and defers to x/net/http2/hpack (GUIDE §2). The framing, demux, and flow control work regardless.
Not a full codec. No CONTINUATION reassembly, PRIORITY trees, or SETTINGS negotiation state machine — out of scope; production = Envoy's HCM / x/net/http2.
No live gRPC LB demo in-process. The skew is reproduced as a shell exercise (steps/03) against real backends.

gw-02 step 01 — Parse the HTTP/2 frame layer

Goal

Decode an HTTP/2 connection at the frame level: the connection preface, the 9-byte frame header, and the common frame types. This makes multiplexing visible — you'll watch frames from multiple streams interleave on one connection.

Background — the preface and frame header

Every h2 connection starts with a fixed client preface then a SETTINGS frame both ways:

client preface (24 bytes): "PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n"
then: SETTINGS frame (client) and SETTINGS frame (server), each ACKed

Frame header (9 bytes), big-endian:

length:  uint24   (payload length, not incl. the 9-byte header)
type:    uint8    (0=DATA 1=HEADERS 3=RST_STREAM 4=SETTINGS
                   6=PING 7=GOAWAY 8=WINDOW_UPDATE ...)
flags:   uint8
stream:  uint31   (high bit reserved; 0 = connection-level)

Code — `src/go/frame.go`

package h2

import (
	"encoding/binary"
	"fmt"
	"io"
)

type FrameType uint8

const (
	FrameData         FrameType = 0x0
	FrameHeaders      FrameType = 0x1
	FramePriority     FrameType = 0x2
	FrameRSTStream    FrameType = 0x3
	FrameSettings     FrameType = 0x4
	FramePushPromise  FrameType = 0x5
	FramePing         FrameType = 0x6
	FrameGoAway       FrameType = 0x7
	FrameWindowUpdate FrameType = 0x8
	FrameContinuation FrameType = 0x9
)

func (t FrameType) String() string {
	return [...]string{"DATA", "HEADERS", "PRIORITY", "RST_STREAM",
		"SETTINGS", "PUSH_PROMISE", "PING", "GOAWAY", "WINDOW_UPDATE",
		"CONTINUATION"}[t]
}

// Flag bits (meaning depends on frame type).
const (
	FlagEndStream  uint8 = 0x1 // DATA, HEADERS
	FlagAck        uint8 = 0x1 // SETTINGS, PING
	FlagEndHeaders uint8 = 0x4 // HEADERS, CONTINUATION
	FlagPadded     uint8 = 0x8
)

type FrameHeader struct {
	Length   uint32 // 24-bit
	Type     FrameType
	Flags    uint8
	StreamID uint32 // 31-bit
}

type Frame struct {
	Header  FrameHeader
	Payload []byte
}

const ClientPreface = "PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n"

// ReadFrame reads one frame (9-byte header + payload) from r.
func ReadFrame(r io.Reader) (*Frame, error) {
	var hdr [9]byte
	if _, err := io.ReadFull(r, hdr[:]); err != nil {
		return nil, err
	}
	length := uint32(hdr[0])<<16 | uint32(hdr[1])<<8 | uint32(hdr[2])
	fh := FrameHeader{
		Length:   length,
		Type:     FrameType(hdr[3]),
		Flags:    hdr[4],
		StreamID: binary.BigEndian.Uint32(hdr[5:9]) & 0x7fffffff, // mask reserved bit
	}
	payload := make([]byte, length)
	if _, err := io.ReadFull(r, payload); err != nil {
		return nil, err
	}
	return &Frame{Header: fh, Payload: payload}, nil
}

func (f *Frame) String() string {
	end := ""
	if f.Header.Type == FrameData || f.Header.Type == FrameHeaders {
		if f.Header.Flags&FlagEndStream != 0 {
			end = " END_STREAM"
		}
	}
	return fmt.Sprintf("stream=%-3d %-13s len=%-5d flags=0x%02x%s",
		f.Header.StreamID, f.Header.Type, f.Header.Length, f.Header.Flags, end)
}

A "frame sniffer" main — watch multiplexing happen

src/go/cmd/h2sniff/main.go reads a captured h2 byte stream (e.g. from a file you dump, or proxy a real curl --http2-prior-knowledge through an io.TeeReader) and prints frames:

func main() {
	r := bufio.NewReader(os.Stdin)
	pre := make([]byte, len(h2.ClientPreface))
	io.ReadFull(r, pre) // consume the connection preface
	for {
		fr, err := h2.ReadFrame(r)
		if err != nil {
			return
		}
		fmt.Println(fr)
	}
}

Generate a capture and feed it in:

# Cleartext h2 (prior knowledge) so you can read frames without TLS:
nghttpd --no-tls 9000 &                       # or any h2c origin
nghttp -v http://127.0.0.1:9000/ -m 5         # -m 5: 5 concurrent reqs
# nghttp -v already prints frames; the lab's parser reproduces that view
# from raw bytes so you understand the wire, not just the tool's output.

Tasks

Implement ReadFrame and the type/flag decoding.
Feed it an h2 capture with multiple concurrent requests (nghttp -m 8). In the output, find frames with different stream= values interleaved — that's multiplexing you can point at.
Identify the SETTINGS exchange at the start and the END_STREAM flag that closes each request's stream.

Acceptance

Your sniffer prints a frame-by-frame trace matching nghttp -v for the same connection (same stream IDs, types, END_STREAM flags).
You can point to interleaved stream IDs and explain that one TCP connection is carrying N concurrent requests.

Discussion prompts

Why are client-initiated stream IDs always odd? What are even IDs for?
A HEADERS frame without END_HEADERS must be followed by CONTINUATION frames — why does that exist, and why is a flood of empty CONTINUATION frames a DoS (the "Rapid Reset" cousin)?
Where, exactly, would an L4 LB have to stop to make per-request routing possible? (It can't — it only sees TCP segments, not frames.)

gw-02 step 02 — HPACK decoding and flow-control accounting

Goal

Decode HPACK-compressed headers using the static table, then track HTTP/2 flow-control windows so you can explain (and debug) stalls. These are the two stateful mechanisms that trip up gateway engineers.

Part A — HPACK static-table decoding

HPACK encodes each header field as one of: an indexed field (a single index into the static+dynamic table), or a literal field (an index for the name + a string for the value, with optional Huffman coding). The 61-entry static table starts:

 1  :authority
 2  :method GET
 3  :method POST
 4  :path /
 5  :path /index.html
 6  :scheme http
 7  :scheme https
 8  :status 200
 ...
61  www-authenticate

Use the stdlib decoder rather than reimplementing Huffman — the point is to see the table indices, not to rebuild the codec:

package h2

import "golang.org/x/net/http2/hpack"

// DecodeHeaders decodes a HEADERS frame's HPACK block into key/value
// pairs. dec must persist across frames on the same connection so the
// dynamic table stays in sync (this is the subtle part).
func DecodeHeaders(dec *hpack.Decoder, block []byte) ([]hpack.HeaderField, error) {
	var out []hpack.HeaderField
	dec.SetEmitFunc(func(hf hpack.HeaderField) { out = append(out, hf) })
	if _, err := dec.Write(block); err != nil {
		return nil, err
	}
	return out, dec.Close()
}

// NewDecoder makes a decoder with a bounded dynamic table — bounding it
// is your defense against an HPACK-table-bloat DoS.
func NewDecoder(maxTable uint32) *hpack.Decoder {
	return hpack.NewDecoder(maxTable, nil)
}

The proxy gotcha: if your gateway rewrites a header (adds x-forwarded-for, strips a hop-by-hop header), it must re-encode with its own encoder whose dynamic table tracks what it actually sent to the origin — not blindly forward the client's HPACK block. Mixing the two desynchronizes the table and corrupts every subsequent header.

Part B — flow-control accounting

Model the two-level window. A sender may transmit DATA only while both the stream window and the connection window are positive; each DATA byte debits both; WINDOW_UPDATE credits one of them.

package h2

import "errors"

type FlowController struct {
	conn    int64            // connection-level window (bytes we may still send)
	streams map[uint32]int64 // per-stream windows
}

const defaultInitialWindow = 65535 // SETTINGS_INITIAL_WINDOW_SIZE default

func NewFlowController() *FlowController {
	return &FlowController{conn: defaultInitialWindow, streams: map[uint32]int64{}}
}

func (fc *FlowController) ensure(stream uint32) {
	if _, ok := fc.streams[stream]; !ok {
		fc.streams[stream] = defaultInitialWindow
	}
}

// CanSend reports whether n DATA bytes are permitted right now.
func (fc *FlowController) CanSend(stream uint32, n int64) bool {
	fc.ensure(stream)
	return fc.conn >= n && fc.streams[stream] >= n
}

// Sent debits both windows after sending n DATA bytes.
func (fc *FlowController) Sent(stream uint32, n int64) error {
	if !fc.CanSend(stream, n) {
		return errors.New("flow control violation: window exhausted")
	}
	fc.conn -= n
	fc.streams[stream] -= n
	return nil
}

// WindowUpdate credits a window (stream==0 means connection-level).
func (fc *FlowController) WindowUpdate(stream uint32, inc int64) {
	if stream == 0 {
		fc.conn += inc
		return
	}
	fc.ensure(stream)
	fc.streams[stream] += inc
}

Tasks

Decode the HEADERS frames from your step-01 capture; print each as :method GET :path /foo. Confirm a repeated header (e.g. a cookie) shows up as a short indexed reference on its second occurrence.
Drive the FlowController with the DATA and WINDOW_UPDATE frames from a real download. Log the connection window over time; watch it drain toward zero and recover on each WINDOW_UPDATE.
Reproduce a stall: stop emitting WINDOW_UPDATE (simulate a slow consumer) and show CanSend returns false while data is pending — the application-level backpressure that looks like a network hang.

Acceptance

Decoded headers match nghttp -v's header dump for the same request.
Your flow-control log shows windows draining and recovering, and you can produce a deliberate stall and explain it as window exhaustion, not packet loss.

Discussion prompts

Why is a too-small SETTINGS_INITIAL_WINDOW_SIZE a throughput bug on a high-latency (high bandwidth-delay-product) link? (Hint: the window caps in-flight bytes ≈ throughput × RTT.)
Why must the HPACK decoder persist for the life of the connection, and what breaks if you create a fresh one per frame?
How is h2 flow control different from, and layered on top of, TCP's own receive-window flow control?

gw-02 step 03 — gRPC, trailers, and the load-balancing trap

Goal

Understand gRPC's wire format (it's just HTTP/2 with conventions), see why grpc-status lives in trailers, and reproduce the classic failure: gRPC behind an L4 load balancer pins all calls to one backend.

Background — gRPC on the wire

HEADERS  :method=POST  :path=/echo.Echo/Unary  content-type=application/grpc
DATA     [0][00 00 00 05][protobuf bytes...]    ← 1 flag byte + 4 len + msg
...                                              (more DATA for streaming)
HEADERS  grpc-status: 0  grpc-message:          ← TRAILERS (END_STREAM here)

The response status you care about is grpc-status (0 = OK), carried in trailers — a second HEADERS frame after the body.
Therefore HTTP 200 + grpc-status 14 is a failed call. A gateway that logs only :status reports it as success. (gw-11 must read trailers.)
Streaming RPCs keep the stream open for many DATA frames — the connection is long-lived and multiplexed.

Reproduce the L4 LB trap

Spin up 3 identical gRPC backends, put an L4 balancer in front (round-robin on connections), and a client that opens one channel (one h2 connection). Then send many calls.

# 3 backends on :9001..:9003 (use grpc's helloworld or any echo server)
for p in 9001 9002 9003; do grpc_echo_server --port $p & done

# L4 round-robin in front (your gw-01 proxy, or nginx stream, or envoy
# tcp_proxy) on :9000 -> {9001,9002,9003}
/tmp/l4proxy -listen :9000 -origin 127.0.0.1:9001   # connection-level pin

# One client channel, 1000 unary calls:
ghz --insecure --proto echo.proto --call echo.Echo/Unary \
    -c 1 -n 1000 127.0.0.1:9000

Observe: nearly all 1000 calls land on a single backend, because the one h2 connection was balanced once, at connect time, and all 1000 multiplexed streams ride it. Now compare:

L7 balancer (Envoy with http_connection_manager + a gRPC route, or any per-request proxy): the same 1000 calls spread ~evenly across the 3 backends, because each request is balanced.
Client-side LB (gRPC round_robin policy with multiple resolved addresses): also spreads, by opening one subchannel per backend.

Tasks

Stand up 3 backends and reproduce the skew through an L4 LB; capture per-backend request counts.
Swap in an L7 path (Envoy gRPC route is the realistic one — see gw-08) and show the distribution evens out.
Make a backend return grpc-status: 14 while keeping HTTP 200. Confirm an L4/HTTP-only view calls it healthy, and that reading the trailer reveals the failure. Note where this matters for retries (gw-06) and SLOs (gw-11).

Acceptance

A reproduced, quantified skew (e.g. "987/1000 on backend A") through the L4 LB, and an even split through the L7/client-side path.
A demonstration that grpc-status in trailers can disagree with the HTTP status, and a one-paragraph explanation of the consequences for monitoring and retries.

Discussion prompts

Three fixes for "gRPC behind a connection-level LB": (a) L7/per-request proxy, (b) client-side LB with one subchannel per backend, (c) force periodic reconnects with MAX_CONNECTION_AGE so connections re-balance. When would you pick each?
Why does MAX_CONNECTION_AGE partially fix it but also reintroduce exactly the connection churn that gw-04 is trying to eliminate? (The tension between rebalancing and connection reuse is a real design trade-off — be ready to discuss it.)
A streaming RPC pins to one backend for its whole life. How do you drain a backend that holds long-lived gRPC streams? (Tie to GOAWAY and gw-01/gw-05 draining.)

gw-03 — Building an API Gateway: The Zuul Model

This is the lab the role is named after. An API gateway is the single front door for a fleet of services: it terminates the client protocol (gw-02), authenticates, routes each request to the right backend, applies policy (rate limits, header rewrites), proxies the request with resilience (gw-06), and emits observability (gw-11). At Netflix this is Zuul — 80+ clusters, fronting ~100 backend service clusters, more than 1M requests per second. The "Evolution of Edge @ Netflix" talk is the story of how that gateway went from a blocking, thread-per-request design (Zuul 1) to an asynchronous, non-blocking, Netty-based design (Zuul 2).

You will build a gateway in the Zuul 2 shape: a non-blocking acceptor (from gw-01) wrapping a filter chain with three phases — inbound, endpoint, outbound — and a routing table. The filter-chain model is the single most important architectural idea in this phase, because it's how you "design, develop, and integrate functionality within the data plane" — the exact words of the JD.

1. What is it?

An API gateway is a reverse proxy with a programmable request lifecycle. Every cross-cutting concern that you'd otherwise duplicate in every service — auth, routing, rate limiting, retries, logging, header hygiene, canary steering — is implemented once, as a filter, and runs on every request at the edge.

Zuul's model (which Envoy, Spring Cloud Gateway, and the Kubernetes Gateway API all echo) decomposes the lifecycle into ordered phases:

            ┌──────────────────────── ZUUL ────────────────────────┐
 client ──▶ │ INBOUND filters ─▶ ENDPOINT (route) ─▶ OUTBOUND filters│ ──▶ client
            │   authn/authz          pick origin        response       │
            │   routing decision     proxy the call      rewrite        │
            │   rate limit           (or serve locally)   logging       │
            └────────────────────────────┬──────────────────────────┘
                                          │ proxy
                                          ▼
                                   origin service cluster

Inbound filters run on the request before routing: decode, authenticate, decorate, decide where it goes, reject early (rate limit / WAF).
Endpoint filters are the terminal action: usually "proxy to the chosen origin," but can be "serve a static/health response" or "return an error." This is where the origin call (gw-04, gw-06) happens.
Outbound filters run on the response: rewrite headers, strip hop-by-hop headers, record metrics, emit the access log.

The other half is the threading model. Zuul 2 runs on Netty: a small pool of event-loop threads, each owning many connections, each never blocking. This is the gw-01 acceptor generalized — and the source of the "you must never block the event loop" discipline.

2. Why does it matter?

It's the literal job. "Designing, developing, and integrating functionality within our data plane and control plane to uplift the security, observability, and resilience posture" is writing filters and the machinery that runs them. Everything else in this phase plugs into this model.
The blocking→async transition is the team's defining story. Zuul 1 used one thread per request; under load, threads pile up waiting on slow origins and the gateway falls over (thread-pool exhaustion). Zuul 2 uses an event loop so a slow origin parks a cheap continuation, not an expensive thread. ~25% throughput gain, ~25% less CPU. If you understand why (gw-01's C10K), you can speak to it credibly.
The filter chain is how you reason about ordering and blast radius. "Where does auth run relative to rate limiting? Should the WAF run before or after routing? If an inbound filter throws, do outbound filters still run?" These are daily design-review questions, and the filter model gives you a precise vocabulary for them.
It's the integration point for every other lab. Security (gw-07) is an inbound filter; the connection pool (gw-04) and resilience (gw-06) live in the endpoint filter; observability (gw-11) is an outbound filter plus instrumentation throughout.

3. How does it work?

The request context

A single mutable request context flows through all phases, carrying the request, the (eventual) response, the chosen route, attributes set by earlier filters, and timing. In Zuul this is SessionContext.

RequestContext {
    request    : the inbound HTTP request (method, path, headers, body stream)
    route      : chosen origin + metadata (set by a routing filter)
    response   : the response being built (set by endpoint/outbound)
    attributes : map for cross-filter data (e.g. authenticated identity)
    timings    : per-phase latency for the access log
    shouldStop : early-exit flag (a filter can short-circuit the chain)
}

Filter shape

Every filter declares its type (inbound/endpoint/outbound), an order (filters run sorted within a phase), a shouldFilter predicate (does it apply to this request?), and an apply body. This is exactly Zuul's ZuulFilter interface:

Filter {
    type()        -> Inbound | Endpoint | Outbound
    order()       -> int                  # lower runs first
    shouldFilter(ctx) -> bool             # cheap applicability check
    apply(ctx)                            # the work
}

The async discipline (the heart of Zuul 2)

On an event loop you must never block. A filter that calls an origin returns a future/promise, and the framework resumes the chain when it completes — the event-loop thread is freed to service other connections meanwhile.

inbound chain (sync, cheap) ─▶ endpoint filter returns Future<Response>
        event loop is now FREE to serve other connections
        ... origin responds ...
        event loop resumes ─▶ outbound chain ─▶ write response

In Go, goroutines + the netpoller give you this for free (a blocked goroutine doesn't block an OS thread). In Java/Netty you do it explicitly with CompletableFuture/Observable and you must be disciplined: one synchronous DB call or blocking lock on the event-loop thread stalls every connection that thread owns. This is the #1 operational hazard of the model and a guaranteed interview topic.

Routing

The endpoint phase needs a route: map the request to an origin cluster. A routing table maps a matcher (host + path prefix + method + header predicates) to a cluster name; the cluster resolves to endpoints via service discovery (gw-08) and gets balanced + pooled (gw-04, gw-06).

route table (longest-prefix / first-match wins):
  Host=api.x.com  Path=/v1/play/*    -> cluster "playback"
  Host=api.x.com  Path=/v1/search/*  -> cluster "search"
  Header x-canary=true  Path=/v1/*   -> cluster "playback-canary"   (gw-12)
  default                            -> cluster "edge-fallback"

Routing is dynamic: the control plane pushes table updates to the data plane without a redeploy (gw-08). That push/no-redeploy property is what makes a gateway operable at fleet scale.

Where each concern lives

Concern	Phase	Lab
TLS / mTLS termination	before inbound (connection setup)	gw-07
Decode / normalize protocol	inbound	gw-02
Authn / authz	inbound	gw-07
Rate limit / load shed	inbound	gw-06
Routing decision	inbound	this lab
Pick origin + pooled connection	endpoint	gw-04
Retry / circuit break / hedge	endpoint	gw-06
Proxy with backpressure	endpoint	gw-01
Response header rewrite	outbound	this lab
Access log / metrics / trace	outbound (+ throughout)	gw-11

4. Core terminology

Term	Definition
API gateway	A reverse proxy with a programmable per-request lifecycle and dynamic routing.
Filter	A pluggable unit of request/response logic, typed by phase and ordered.
Inbound / Endpoint / Outbound	The three Zuul filter phases: pre-routing, terminal action, post-response.
Request context	The mutable per-request state carried through the chain (Zuul `SessionContext`).
Event loop	A thread that multiplexes many connections via `epoll`/`kqueue` and must never block.
Thread-per-request	The Zuul 1 / blocking model; thread-pool exhaustion under slow origins.
Route / cluster	A matcher → named backend mapping; cluster resolves to endpoints via discovery.
Short-circuit	A filter ending the chain early (auth reject, cache hit, rate-limit 429).
Hop-by-hop headers	Headers a proxy must consume, not forward (`Connection`, `Keep-Alive`, `TE`, `Upgrade`...).
Dynamic config	Routing/policy pushed from the control plane without redeploy (gw-08).

5. Mental models

A gateway is middleware with a deploy boundary. A filter chain is the same idea as web-framework middleware (func(next) handler), but it runs as its own fleet in front of every service, so a change is one deploy instead of N. That centralization is the value and the risk — a bad filter breaks everything at once (gw-12 is about shipping changes safely).
The event loop is a kitchen pass, not a line of cooks. One expediter (event-loop thread) watches every order (connection) and hands off work the instant it's ready. If the expediter stops to chop an onion (a blocking call), the whole pass halts. The async rule ("never block the loop") is "the expediter never picks up a knife."
Filters are a pipeline with a kill switch. Each stage can pass the request along, modify it, or stop the line and return a response. Auth that fails stops the line before routing — you never want to open an origin connection for a request you're going to 401.
Routing is a routing table, not an if/else. Treat it as data pushed from a control plane, matched by precedence, hot-reloadable. The moment routing lives in code you redeploy to change a route, and at fleet scale that's the difference between minutes and a day.

6. Common misconceptions

"Async is just for performance." It changes correctness under overload. A thread-per-request gateway with slow origins exhausts its thread pool and stops serving fast origins too (head-of-line at the pool). The event-loop model degrades gracefully because a slow origin costs a cheap parked continuation, not a thread.
"A gateway should do everything." Centralizing too much logic makes the gateway a monolith with fleet-wide blast radius and a bottleneck team. The discipline is: cross-cutting infrastructure concerns at the gateway; business logic in services. (Netflix's history with this is instructive — they pushed dynamic Groovy filters, then pulled back toward typed, reviewed filters.)
"Filters run in the order I wrote them." They run in declared order within their phase. Auth (inbound, order 10) always precedes routing (inbound, order 50) regardless of source layout. Getting ordering explicit is the whole point.
"Blocking once is fine." On an event loop, one blocking call on the loop thread stalls every connection that thread owns — possibly thousands. There is no "just this once." Offload blocking work to a separate executor or make it async.
"The gateway is stateless." Mostly — but it holds connection pools, rate-limiter token buckets, circuit-breaker state, and HPACK tables. That state is why draining and config push need care.

7. Interview talking points

"Design an API gateway." Lead with the data/control-plane split, then the filter-chain lifecycle (inbound→endpoint→outbound), then the threading model (event loop, never block), then where resilience and observability hook in. Name Zuul's three phases explicitly. This single answer demonstrates the whole phase.
"Why did Zuul go async/Netty, and what did it cost?" Thread-per- request exhausts under slow origins (head-of-line at the thread pool); event loop parks cheap continuations instead. Gain: ~25% throughput and CPU. Cost: every filter must be non-blocking; debugging async stacks is harder; one blocking call poisons a loop. The honest cost/benefit is the senior answer.
"Where does auth run, and why before routing?" Inbound, early order, before the routing/endpoint phase — so you never open an origin connection or do work for a request you'll reject. Short-circuit on failure. This shows you think about blast radius and resource use.
"How do you change a route without a redeploy?" Routing is data pushed from the control plane (gw-08); the data plane hot-reloads it atomically. Discuss propagation safety (validate before apply, version configs, roll out incrementally) — that's the consensus intuition from db-17 applied to config.
"A filter needs to call an external service (e.g., authz). How, without wrecking latency?" Make it async with a tight timeout and a fallback (fail-open vs fail-closed is a security decision — gw-07); cache decisions; never block the event loop; budget the added latency against the request SLO.
"How do you safely roll out a risky new filter to the whole fleet?" Flag-gate it, shadow it (run it, don't act on the result, compare), canary a small % of traffic, watch RED metrics, auto-roll- back on SLO breach (gw-11, gw-12). Acknowledge the fleet-wide blast radius explicitly.

8. Connections to other labs

gw-01 (L4) is the acceptor + event loop this gateway wraps; the threading discipline starts there.
gw-02 (L7) is the protocol decoding the inbound phase depends on.
gw-04 (connection management) lives in the endpoint phase — the pooled, subsetted origin call.
gw-06 (resilience) is endpoint-phase logic: retries, circuit breaking, adaptive concurrency, load shedding.
gw-07 (security) is the inbound authn/authz filter + TLS termination.
gw-08 (Envoy/xDS) is the control plane that pushes this gateway's routes and clusters; Envoy's HCM + filter chain is the production twin of this lab.
gw-11 (observability) is the outbound access-log/metrics/trace filter plus instrumentation throughout.
gw-12 (migration) is how you ship changes to this fleet safely.

gw-03 — The Hitchhiker's Guide to the API Gateway

Companion to CONCEPTS.md, with the runnable gateway in src/go/gateway/. This is the lab the role is named after; the filter-chain model here is the mental model you'll use every day on a Cloud Gateway team.

An API gateway is a reverse proxy with a programmable request lifecycle. Zuul decomposes that lifecycle into three ordered phases — inbound → endpoint → outbound — and every cross-cutting concern (auth, routing, rate limiting, retries, logging) is a filter in one of them. Build that engine once and you understand Zuul, Spring Cloud Gateway, Envoy's HTTP filters, and the Gateway API all at once.

1. The engine (gateway.go)

Gateway.ServeHTTP is the whole lifecycle in 25 lines:

build RequestContext
run INBOUND filters (sorted by Order) — STOP early if one short-circuits
run the ENDPOINT filter (only if not stopped and it applies)
run OUTBOUND filters (ALWAYS — even after a short-circuit)
write the response

Three design decisions carry the weight:

One mutable RequestContext threaded through the chain. This is Zuul's SessionContext. It carries the request, the response being built, the chosen route, an attributes map for cross-filter data (identity, cluster, status_class), and per-filter timings. The cost is shared mutable state — so filters must document what they read/write — but it matches the real codebase exactly.

Filters are interfaces with an explicit Order() and Type(), not closures. TestPhaseOrderingAndShortCircuit proves the ordering: with no Authorization header, the inbound AuthFilter (order 10) short- circuits with a 401 before RoutingFilter (order 50) or the endpoint ever run — you never open an origin connection for a request you're going to reject. Explicit ordering is the point; closures would hide it.

Outbound always runs. Notice the loop has no if !stopped guard around the outbound phase. TestPhaseOrderingAndShortCircuit asserts the access log fires even on the 401. Metrics, logs, and trace completion must never be skipped just because an earlier filter rejected the request — otherwise your dashboards undercount errors precisely when errors spike.

Panic isolation. runOne wraps each Apply in recover(): a filter that panics becomes a 502, the chain continues, and the access log still fires (TestPanicBecomes502). A gateway that can crash a connection on one filter's bug is not production-grade — and at fleet-wide blast radius, one bad filter would otherwise take down everything.

2. Routing as data (routing.go)

Routing is a table you can replace atomically, not an if/else you redeploy to change. Router holds an atomic.Pointer[RouteTable]; Swap replaces the whole table in one instruction.

TestHotReloadNoDrops is the important one: four reader goroutines continuously match requests while a fifth swaps the table in a tight loop, and the test asserts zero failed lookups across tens of thousands of matches — under -race. That lock-free swap is exactly how a data-plane node applies a control-plane config push (gw-08) without dropping a request mid-reconfiguration. The read path (1M+ rps) never takes a lock; only the rare write allocates a new table and swaps the pointer.

Specificity, not just longest-prefix

The subtle part is precedence. Match ranks routes the way the Gateway API and Envoy do: longest prefix wins, then among equal prefixes the more-constrained route (a header predicate, then a method) wins. TestLongestPrefixAndHeaderPredicate has two routes with the same prefix /v1/play — a base route and a canary route gated on x-canary: true. Naive "first match of the longest prefix" would always pick the base route and the canary would be dead config. The specificity score (len(prefix)*10 + headerBonus*2 + methodBonus) makes the canary win for canary requests — which is precisely how header-based canary steering (gw-12) is implemented. Getting this ordering right is a real maintainer detail; getting it wrong silently breaks canaries.

3. The endpoint: where the origin call lives (filters.go)

ProxyEndpoint.Apply is the terminal action. Read it for four things you must get right in any proxy:

A per-request timeout is the first resilience primitive. context.WithTimeout bounds the origin call; TestProxyTimeoutIs504 proves a slow origin yields 504, not a hung request. Everything in gw-06 (retries, hedging, circuit breaking) is built on top of bounded latency; without the timeout, they're built on sand.
Error classification. statusForError maps a deadline to 504 and anything else (connection refused, reset) to 502 (TestProxyDeadOriginIs502). The difference matters to clients and to your SLOs (gw-11).
Hop-by-hop hygiene. sanitizeForOrigin strips Connection, Keep-Alive, Transfer-Encoding, Upgrade, etc. (RFC 9110 §7.6.1) and appends X-Forwarded-For. TestProxyForwardsToOrigin asserts the origin never sees the client's Connection header and does see an XFF. Forwarding hop-by-hop headers is a classic correctness/security bug.
The Transport seam. ProxyEndpoint.Transport is an http.RoundTripper. Swap in gw-04's pooled+subsetted transport and gw-06's resilient transport here — without touching the gateway engine. That seam is the whole point: cross-cutting concerns compose.

4. The async discipline (the Zuul 2 lesson)

The CONCEPTS file explains why Zuul moved from thread-per-request to Netty's event loop: under slow origins, a thread-per-request pool exhausts and stops serving fast origins too (head-of-line at the pool); an event loop parks a cheap continuation instead. ~25% throughput and CPU win.

In Go you get this for free: a goroutine blocked in RoundTrip parks on the netpoller (gw-01), not an OS thread, so "blocking" code scales. The cost you must respect — and the thing that bites Java/Netty teams — is that on a real event loop you must never block the loop thread: no synchronous DB call, no blocking lock, in a filter. The discipline is "the expediter never picks up a knife." When you read Zuul's HttpInboundSyncFilter vs the async filter types, that distinction (CPU- only inline vs I/O off-loop) is exactly this.

5. Hands-on

cd src/go
bash ../scripts/verify.sh      # vet + build + test -race

go build -o /tmp/gateway ./cmd/gateway
python3 -m http.server 9000 &              # an origin
/tmp/gateway -listen :8080 -route /v1=http://127.0.0.1:9000 &

curl -i localhost:8080/healthz                       # 200 (auth-exempt, local endpoint)
curl -i localhost:8080/v1/thing                      # 401 (no token)
curl -i -H 'Authorization: Bearer t' localhost:8080/v1/  # 200 via origin
curl -i -H 'Authorization: Bearer t' localhost:8080/nope  # 404 (no route)

Watch the access log: every request is logged with status and route — including the 401 and 404 — because outbound always runs.

6. Where the other labs plug in

Phase	Filter	Lab
inbound	TLS terminate + mTLS identity	gw-07
inbound	rate limit / load shed	gw-06
inbound	routing	this lab
endpoint	pooled+subsetted `Transport`	gw-04
endpoint	retries / circuit breaking / hedging	gw-06
endpoint	proxy with backpressure	gw-01
outbound	access log / metrics / trace	gw-11
(config)	routes pushed by a control plane	gw-08 / gw-10

The gateway engine you built is the frame every other Phase-6 lab hangs on.

7. Exercises

Add a rate-limit filter (inbound, order 20, before routing) using a token bucket; short-circuit with 429 when empty. Then make it per-client (keyed by identity). Why is a per-instance bucket actually limit × instances fleet-wide, and when does that matter (gw-06)?
Stream the response body instead of buffering it in ProxyEndpoint. What do you gain (latency, bounded memory) and lose (the ability to inspect/transform the body, easy content-length)?
Add a header-rewrite outbound filter that strips a sensitive upstream header. Confirm via the access log that it runs after the proxy.
Implement a "shadow" filter (gw-12 preview): in the endpoint, additionally fire the request at a second cluster, discard its response, and record a diff — without affecting the user's response.
Make routing dynamic from a file: watch a JSON routes file and Swap the table on change. Prove with the hot-reload test pattern that live traffic sees no drops. You've now built the data-plane half of gw-08.

gw-03 — References

The Zuul lineage (read these closely)

Netflix/zuul (GitHub) — the open-source gateway. Read the filter interfaces (ZuulFilter, HttpInboundFilter, HttpOutboundFilter, HttpInboundSyncFilter), SessionContext, and the Netty ChannelHandler pipeline. https://github.com/Netflix/zuul
Zuul 2: The Netflix Journey to Asynchronous, Non-Blocking Systems — the blocking→event-loop rationale; ~25% throughput/CPU win. https://netflixtechblog.com/zuul-2-the-netflix-journey-to-asynchronous-non-blocking-systems-45947377fb5c
Open Sourcing Zuul 2 — architecture overview, the filter phases, the 80+ clusters / 1M+ rps scale. https://netflixtechblog.com/open-sourcing-zuul-2-82ea476cb2b3
Zuul wiki — filter types, push messaging, request lifecycle. https://github.com/Netflix/zuul/wiki

Comparable gateways to contrast

Envoy http_connection_manager + HTTP filters — the C++ production twin of this lab; read one filter end to end (gw-08). https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http_filters
Spring Cloud Gateway — the JVM successor many teams use; same filter model, reactive (Project Reactor) instead of raw Netty.
Kubernetes Gateway API — the standardized routing CRDs (gw-10).
NGINX/OpenResty — the Lua-scriptable predecessor model.

Async / event-loop foundations

Netty in Action — EventLoopGroup, ChannelPipeline, ChannelHandler, the "never block the event loop" rule.
ReactiveX / Project Reactor docs — the Observable/Mono/Flux model Zuul 2 and Spring Cloud Gateway use for async chains.
Go net/http ReverseProxy (net/http/httputil) — the stdlib reverse proxy this lab's Go version builds on; read ReverseProxy and Transport.

Cross-lab dependencies

Upstream: gw-01 (acceptor/event loop), gw-02 (L7 decoding).
Downstream: gw-04 (endpoint-phase connection pool), gw-06 (endpoint-phase resilience), gw-07 (inbound security), gw-08 (control plane pushing routes), gw-11 (outbound observability), gw-12 (shipping filter changes safely).

gw-03 — Analysis

A design-review treatment of the filter-chain gateway you build in steps/.

Required behaviors

Phase ordering is enforced, not incidental. Inbound filters all run (in order) before the endpoint filter; outbound filters all run after. The framework, not the filter author, guarantees this.
Short-circuit is first-class. Any inbound filter can set the response and stop the chain (401, 429, cache hit). Once stopped, no further inbound or endpoint filters run, but outbound filters still run (so the access log and metrics always fire).
Errors are handled in-band. A filter that throws is converted to a 5xx by an error filter; the chain still produces a response and an access log. A gateway that can crash a connection on a filter bug is not production-grade.
Never block the loop. Endpoint origin calls are async with a timeout; in Go this is a goroutine + context deadline, in Java/Netty a CompletableFuture with a timeout.
Hop-by-hop hygiene. Strip Connection, Keep-Alive, Proxy-*, TE, Transfer-Encoding, Upgrade before forwarding (RFC 9110 §7.6.1); add/normalize X-Forwarded-For/Forwarded.

Design decisions

One mutable RequestContext threaded through the chain. Mirrors Zuul's SessionContext. Simpler than passing many args, and matches the real codebase so the lab transfers. The cost is shared mutable state — filters must document what attributes they read/write.
Filters are interfaces, not closures. Type(), Order(), ShouldFilter(), Apply(). The explicit Order() and Type() make ordering reviewable and testable; closures would hide it.
Endpoint filter returns the response (async). The endpoint phase is the only one that produces a response from an origin; inbound filters only decide and decorate. This keeps "where the origin call happens" in exactly one place — where pooling (gw-04) and resilience (gw-06) plug in.
Routing is data. The route table is a struct you can replace atomically (a pointer swap under an atomic.Value / RWMutex), which previews the control-plane hot-reload of gw-08. No redeploy to change a route.

Tradeoffs worth flagging

Centralization = leverage + blast radius. Every filter runs on every request; a 1ms regression in an inbound filter is 1ms on 1M+ rps. The review bar for gateway code is higher than for a single service. This is why the role lists "high-quality code reviews" and "detailed design reviews" as core duties.
Sync vs async filters. Zuul distinguishes HttpInboundSyncFilter (cheap, CPU-only, runs inline) from async filters (may do I/O). Make the cheap path cheap; only pay the async machinery when a filter actually needs I/O. Misclassifying a blocking filter as sync is the classic event-loop-stall bug.
Dynamic vs typed filters. Netflix historically shipped Groovy filters loadable at runtime (fast iteration, no deploy) and learned the operational cost (a bad dynamic filter, fleet-wide, instantly). Modern practice leans toward typed, reviewed, deployed filters with config-driven behavior. Be able to argue both sides.
State on a "stateless" gateway. Rate-limiter buckets, circuit state, and pools are per-instance state. That means a token-bucket limit is really per-instance × instances unless you use a shared store — a subtle correctness point for distributed rate limiting (gw-06).

What production adds beyond this lab

Request/response body buffering vs streaming decisions (buffer to inspect/transform, stream to preserve latency and bound memory).
A WAF / security filter stage and bot management (gw-07).
Per-route timeouts, retries, and circuit-breaker config from the control plane (gw-06, gw-08).
Backpressure into the filter chain so an overloaded gateway sheds load at admission rather than collapsing (gw-06).
Full observability: per-filter latency, per-route RED metrics, trace spans that cross the proxy (gw-11).
Safe rollout machinery for filters: flags, shadow, canary, auto-rollback (gw-12).

gw-03 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline). Optional: curl, python3.

One-shot

cd gw-03-api-gateway && bash scripts/verify.sh   # → "=== gw-03 OK ==="

Per-language workflow (Go)

cd gw-03-api-gateway/src/go
go test -race -count=1 ./...      # 8 tests in package gateway
go build -o /tmp/gateway ./cmd/gateway

Run it

python3 -m http.server 9000 &
/tmp/gateway -listen :8080 -route /v1=http://127.0.0.1:9000

curl -i localhost:8080/healthz                                  # 200 (exempt)
curl -i localhost:8080/v1/x                                     # 401 (no token)
curl -i -H 'Authorization: Bearer t' localhost:8080/v1/        # 200 via origin
curl -i -H 'Authorization: Bearer t' localhost:8080/nope       # 404 (no route)

-route prefix=originURL is repeatable. Ctrl-C drains in-flight via http.Server.Shutdown.

Package map

File	What
`gateway/gateway.go`	the filter-chain engine (phases, context, panic isolation)
`gateway/routing.go`	route table + atomic hot-reload + specificity matching
`gateway/filters.go`	auth, local + proxy endpoints, access log, error normalizer, hop-by-hop hygiene
`cmd/gateway`	the runnable gateway

See GUIDE.md for the deep walkthrough.

gw-03 — Verification

One command

cd gw-03-api-gateway && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestPhaseOrderingAndShortCircuit`	inbound auth (order 10) short-circuits before routing/endpoint; the access log still fires on the 401
`TestExemptPathSkipsAuth`	exempt paths bypass auth and hit the local endpoint
`TestProxyForwardsToOrigin`	request is proxied; hop-by-hop `Connection` stripped; `X-Forwarded-For` added
`TestLongestPrefixAndHeaderPredicate`	longest-prefix + specificity routing; a header-gated canary route beats the base route of equal prefix
`TestPanicBecomes502`	a panicking filter yields 502 (not a dropped connection) and outbound still runs
`TestProxyTimeoutIs504`	a slow origin yields 504 at the per-request deadline
`TestProxyDeadOriginIs502`	a dead origin (connection refused) yields 502
`TestHotReloadNoDrops`	concurrent route-table swaps under load drop zero lookups (lock-free atomic swap)

All under -race — the hot-reload test specifically stresses the atomic-pointer swap against concurrent readers.

What "green" does NOT guarantee

No real event-loop semantics. Go's netpoller hides the loop; the "never block the loop" discipline (GUIDE §4) matters when you port this to Java/Netty.
No pooling/resilience/security yet. Those plug into the Transport and inbound seams (gw-04/06/07).
Body is buffered, not streamed. Streaming is an exercise (GUIDE §7).
Routing is static in the CLI. Dynamic control-plane push is gw-08.

gw-03 step 01 — The filter chain and request context

Goal

Build the Zuul-shaped core: a RequestContext threaded through three ordered phases (inbound → endpoint → outbound), with short-circuit and guaranteed outbound execution. Everything else in this lab plugs into this.

Code — `src/go/gateway.go`

package gw

import (
	"net/http"
	"sort"
	"time"
)

type Phase int

const (
	Inbound Phase = iota
	Endpoint
	Outbound
)

// RequestContext is the mutable state carried through the chain
// (Zuul's SessionContext).
type RequestContext struct {
	Req        *http.Request
	Resp       *ResponseBuilder
	RouteName  string                 // set by a routing filter (step 02)
	Attributes map[string]any         // cross-filter data (e.g. identity)
	Timings    map[string]time.Duration
	stopped    bool                   // short-circuit flag
	startedAt  time.Time
}

func (c *RequestContext) Stop()        { c.stopped = true }
func (c *RequestContext) Stopped() bool { return c.stopped }

// ResponseBuilder is the response being assembled.
type ResponseBuilder struct {
	Status int
	Header http.Header
	Body   []byte
}

// Filter is the unit of gateway logic (Zuul's ZuulFilter).
type Filter interface {
	Type() Phase
	Order() int                       // lower runs first within a phase
	ShouldFilter(c *RequestContext) bool
	Apply(c *RequestContext)
}

type Gateway struct {
	inbound  []Filter
	endpoint Filter // exactly one terminal action
	outbound []Filter
}

func (g *Gateway) Use(f Filter) {
	switch f.Type() {
	case Inbound:
		g.inbound = append(g.inbound, f)
		sort.SliceStable(g.inbound, func(i, j int) bool {
			return g.inbound[i].Order() < g.inbound[j].Order()
		})
	case Endpoint:
		g.endpoint = f
	case Outbound:
		g.outbound = append(g.outbound, f)
		sort.SliceStable(g.outbound, func(i, j int) bool {
			return g.outbound[i].Order() < g.outbound[j].Order()
		})
	}
}

// ServeHTTP runs the full lifecycle for one request.
func (g *Gateway) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	c := &RequestContext{
		Req:        r,
		Resp:       &ResponseBuilder{Status: 0, Header: http.Header{}},
		Attributes: map[string]any{},
		Timings:    map[string]time.Duration{},
		startedAt:  time.Now(),
	}

	// INBOUND: stop as soon as a filter short-circuits.
	for _, f := range g.inbound {
		if c.Stopped() {
			break
		}
		g.runOne(c, f)
	}

	// ENDPOINT: only if not already short-circuited (e.g. by auth/cache).
	if !c.Stopped() && g.endpoint != nil && g.endpoint.ShouldFilter(c) {
		g.runOne(c, g.endpoint)
	}

	// OUTBOUND: ALWAYS runs — access logs/metrics must fire even on 401/429.
	for _, f := range g.outbound {
		g.runOne(c, f)
	}

	g.write(w, c)
}

func (g *Gateway) runOne(c *RequestContext, f Filter) {
	if !f.ShouldFilter(c) {
		return
	}
	start := time.Now()
	defer func() {
		// An error filter (step 03) turns panics into 5xx; the chain
		// must never crash the connection on a filter bug.
		if rec := recover(); rec != nil {
			c.Resp.Status = http.StatusBadGateway
			c.Resp.Body = []byte("filter error")
		}
		c.Timings[name(f)] = time.Since(start)
	}()
	f.Apply(c)
}

func (g *Gateway) write(w http.ResponseWriter, c *RequestContext) {
	if c.Resp.Status == 0 {
		c.Resp.Status = http.StatusOK
	}
	for k, vs := range c.Resp.Header {
		for _, v := range vs {
			w.Header().Add(k, v)
		}
	}
	w.WriteHeader(c.Resp.Status)
	w.Write(c.Resp.Body)
}

Code — three example filters

package gw

import (
	"fmt"
	"net/http"
)

// AuthFilter (inbound, early): short-circuits a missing/invalid token.
type AuthFilter struct{}

func (AuthFilter) Type() Phase  { return Inbound }
func (AuthFilter) Order() int   { return 10 }
func (AuthFilter) ShouldFilter(c *RequestContext) bool { return true }
func (AuthFilter) Apply(c *RequestContext) {
	tok := c.Req.Header.Get("Authorization")
	if tok == "" {
		c.Resp.Status = http.StatusUnauthorized
		c.Resp.Body = []byte("missing token")
		c.Stop() // no routing, no origin call for a 401
		return
	}
	c.Attributes["identity"] = "user-from:" + tok // gw-07 does this properly
}

// HealthEndpoint (endpoint): serve /healthz locally, no origin call.
type HealthEndpoint struct{}

func (HealthEndpoint) Type() Phase { return Endpoint }
func (HealthEndpoint) Order() int  { return 0 }
func (HealthEndpoint) ShouldFilter(c *RequestContext) bool {
	return c.Req.URL.Path == "/healthz"
}
func (HealthEndpoint) Apply(c *RequestContext) {
	c.Resp.Status = http.StatusOK
	c.Resp.Body = []byte("ok")
}

// AccessLog (outbound, last): always runs, even on short-circuit.
type AccessLog struct{}

func (AccessLog) Type() Phase  { return Outbound }
func (AccessLog) Order() int   { return 1000 }
func (AccessLog) ShouldFilter(c *RequestContext) bool { return true }
func (AccessLog) Apply(c *RequestContext) {
	fmt.Printf("%s %s -> %d route=%q timings=%v\n",
		c.Req.Method, c.Req.URL.Path, c.Resp.Status, c.RouteName, c.Timings)
}

Tasks

Implement Gateway and the three filters; wire with http.ListenAndServe(":8080", g).
Confirm phase ordering: curl :8080/healthz (no auth) returns 401 — because auth (inbound, order 10) runs before the endpoint. Add an exemption so /healthz skips auth, and confirm the access log fires on both the 401 and the 200.
Add a second inbound filter at order 50 that sets a header; prove it runs after auth and is skipped when auth short-circuits.

Acceptance

Inbound filters run in Order(); auth short-circuits before routing.
Outbound AccessLog fires on every request, including short-circuited ones — verify the 401 is logged.
A panicking filter yields a 502, not a dropped connection.

Discussion prompts

Why must outbound filters run even when an inbound filter short-circuits? (Metrics/logs/trace completion.)
Why is there exactly one endpoint filter but many inbound/outbound?
Where would you put a rate limiter, and why before routing? (Don't spend an origin connection on a request you'll 429.)

gw-03 step 02 — Routing table, hop-by-hop hygiene, and hot-reload

Goal

Add a data-driven routing table that maps requests to origin clusters, strip hop-by-hop headers correctly, and hot-reload the route table without dropping a request — the property that makes a gateway operable at fleet scale (and a preview of the control plane in gw-08).

Code — the route table

package gw

import (
	"strings"
	"sync/atomic"
)

type Route struct {
	Host    string // "" = any
	Prefix  string // path prefix
	Method  string // "" = any
	Header  [2]string // optional [name,value] predicate, e.g. canary steering
	Cluster string
}

type RouteTable struct {
	routes []Route
}

// Match returns the first route whose predicates all match, longest
// prefix first (so /v1/play/start beats /v1).
func (t *RouteTable) Match(host, method, path string, hdr func(string) string) (string, bool) {
	best := -1
	bestLen := -1
	for i, r := range t.routes {
		if r.Host != "" && r.Host != host {
			continue
		}
		if r.Method != "" && r.Method != method {
			continue
		}
		if !strings.HasPrefix(path, r.Prefix) {
			continue
		}
		if r.Header[0] != "" && hdr(r.Header[0]) != r.Header[1] {
			continue
		}
		if len(r.Prefix) > bestLen {
			best, bestLen = i, len(r.Prefix)
		}
	}
	if best < 0 {
		return "", false
	}
	return t.routes[best].Cluster, true
}

// Router holds an atomically-swappable table for lock-free hot-reload.
type Router struct {
	tbl atomic.Pointer[RouteTable]
}

func (rt *Router) Load() *RouteTable { return rt.tbl.Load() }

// Swap atomically replaces the table. In-flight requests keep using the
// snapshot they read; new requests see the new table. No dropped request.
func (rt *Router) Swap(t *RouteTable) { rt.tbl.Store(t) }

Code — the routing filter (inbound)

type RoutingFilter struct{ R *Router }

func (RoutingFilter) Type() Phase { return Inbound }
func (RoutingFilter) Order() int  { return 50 } // after auth(10), before endpoint
func (RoutingFilter) ShouldFilter(c *RequestContext) bool { return true }

func (f RoutingFilter) Apply(c *RequestContext) {
	cluster, ok := f.R.Load().Match(
		c.Req.Host, c.Req.Method, c.Req.URL.Path, c.Req.Header.Get)
	if !ok {
		c.Resp.Status = 404
		c.Resp.Body = []byte("no route")
		c.Stop()
		return
	}
	c.RouteName = cluster
	c.Attributes["cluster"] = cluster
}

Code — hop-by-hop header hygiene

A proxy must consume, not forward, hop-by-hop headers (RFC 9110 §7.6.1) and must normalize forwarding headers:

var hopByHop = []string{
	"Connection", "Proxy-Connection", "Keep-Alive", "Proxy-Authenticate",
	"Proxy-Authorization", "Te", "Trailer", "Transfer-Encoding", "Upgrade",
}

func sanitizeForOrigin(c *RequestContext) {
	h := c.Req.Header
	// Also remove anything named by the Connection header.
	for _, name := range strings.Split(h.Get("Connection"), ",") {
		if n := strings.TrimSpace(name); n != "" {
			h.Del(n)
		}
	}
	for _, n := range hopByHop {
		h.Del(n)
	}
	// Append client IP to X-Forwarded-For (don't trust an inbound one
	// from an untrusted client — gw-07 trust-boundary discussion).
	if ip := clientIP(c.Req); ip != "" {
		prior := h.Get("X-Forwarded-For")
		if prior != "" {
			h.Set("X-Forwarded-For", prior+", "+ip)
		} else {
			h.Set("X-Forwarded-For", ip)
		}
	}
}

Hot-reload demo

// Build an initial table.
r := &Router{}
r.Swap(&RouteTable{routes: []Route{
	{Host: "api.local", Prefix: "/v1/play", Cluster: "playback"},
	{Host: "api.local", Prefix: "/v1/search", Cluster: "search"},
	{Prefix: "/", Cluster: "edge-fallback"},
}})

// Later, e.g. on a control-plane push or SIGHUP, swap in a new table
// that adds a canary route — with ZERO dropped requests:
r.Swap(&RouteTable{routes: append(current,
	Route{Prefix: "/v1/play", Header: [2]string{"x-canary", "true"},
	      Cluster: "playback-canary"})})

Tasks

Implement the route table + routing filter; verify /v1/play/start → playback, /v1/search/q → search, /anything → edge-fallback, longest-prefix wins.
Add a header-predicate canary route (x-canary: true → playback-canary) and confirm steering works (gw-12 uses exactly this for migrations).
Hot-reload: run wrk continuously against the gateway, call Swap on a ticker to add/remove a route, and show zero errors during the swaps (atomic pointer = no lock, no dropped request).
Verify hop-by-hop headers are stripped and X-Forwarded-For is appended (inspect what the origin receives).

Acceptance

Correct longest-prefix + predicate matching.
Continuous wrk load through repeated Swaps with no 5xx/dropped requests.
Origin sees a clean header set with a correct X-Forwarded-For.

Discussion prompts

Why an atomic.Pointer swap instead of locking the table on every request? (Read path is hot — 1M+ rps; writes are rare.)
How does this hot-reload generalize to a fleet of gateways receiving a control-plane push (gw-08)? What new failure modes appear when the push is partial across the fleet?
Why must you not trust an inbound X-Forwarded-For from an untrusted client, and how does that interact with the PROXY protocol (gw-01)?

gw-03 step 03 — The async proxy endpoint, timeouts, and error handling

Goal

Implement the endpoint filter that actually proxies to the origin — without blocking the event loop — with a per-request timeout, clean error handling, and the seam where gw-04 (pooling) and gw-06 (resilience) will plug in.

Code — the proxy endpoint filter

In Go, the netpoller makes a goroutine "block" cheaply (no OS thread is parked), so the idiomatic version uses context deadlines. The comments mark where the Java/Netty version would use a CompletableFuture.

package gw

import (
	"context"
	"io"
	"net"
	"net/http"
	"time"
)

// ProxyEndpoint forwards the request to the cluster chosen by routing.
// The Transport here is where gw-04's pooled, subsetted connection
// manager and gw-06's resilience policy get injected.
type ProxyEndpoint struct {
	Resolve   func(cluster string) (origin string, ok bool) // service discovery (gw-08)
	Transport http.RoundTripper                              // pooled in gw-04
	Timeout   time.Duration
}

func (ProxyEndpoint) Type() Phase { return Endpoint }
func (ProxyEndpoint) Order() int  { return 0 }
func (p ProxyEndpoint) ShouldFilter(c *RequestContext) bool {
	return c.RouteName != "" && c.RouteName != "edge-fallback"
}

func (p ProxyEndpoint) Apply(c *RequestContext) {
	origin, ok := p.Resolve(c.RouteName)
	if !ok {
		c.Resp.Status = http.StatusServiceUnavailable
		c.Resp.Body = []byte("no healthy origin")
		return
	}

	// Per-request deadline: never let a slow origin pin a request
	// forever. This is the single most important resilience primitive.
	ctx, cancel := context.WithTimeout(c.Req.Context(), p.Timeout)
	defer cancel()

	out := c.Req.Clone(ctx)
	out.URL.Scheme = "http"
	out.URL.Host = origin
	out.Host = origin
	out.RequestURI = "" // required by the client
	sanitizeForOrigin(c) // strip hop-by-hop, set X-Forwarded-For (step 02)

	// RoundTrip "blocks" this goroutine but NOT an OS thread — Go's
	// netpoller is an epoll loop. In Netty you'd return a Future and
	// resume the outbound chain in its callback; same semantics.
	resp, err := p.Transport.RoundTrip(out)
	if err != nil {
		c.Resp.Status = statusForError(err) // 504 on timeout, 502 otherwise
		c.Resp.Body = []byte("origin error: " + err.Error())
		c.Attributes["origin_error"] = err
		return
	}
	defer resp.Body.Close()

	c.Resp.Status = resp.StatusCode
	for k, vs := range resp.Header {
		for _, v := range vs {
			c.Resp.Header.Add(k, v)
		}
	}
	// For a lab we buffer; production streams the body with backpressure
	// (gw-01) to bound memory and preserve latency.
	body, _ := io.ReadAll(io.LimitReader(resp.Body, 8<<20))
	c.Resp.Body = body
}

func statusForError(err error) int {
	var ne net.Error
	if errAs(err, &ne) && ne.Timeout() {
		return http.StatusGatewayTimeout // 504
	}
	return http.StatusBadGateway // 502
}

Code — the error filter (runs as the first outbound filter)

// ErrorNormalizer ensures every response is well-formed even after a
// filter panic or an unset status, and records the outcome class for
// metrics (gw-11).
type ErrorNormalizer struct{}

func (ErrorNormalizer) Type() Phase  { return Outbound }
func (ErrorNormalizer) Order() int   { return 1 } // earliest outbound
func (ErrorNormalizer) ShouldFilter(c *RequestContext) bool { return true }
func (ErrorNormalizer) Apply(c *RequestContext) {
	if c.Resp.Status == 0 {
		c.Resp.Status = http.StatusInternalServerError
	}
	class := c.Resp.Status / 100
	c.Attributes["status_class"] = class // 2/3/4/5 → RED metrics in gw-11
}

Tasks

Implement ProxyEndpoint with a default http.Transport (gw-04 replaces it with a pooled, subsetted one). Stand up a slow origin (sleep handler) and confirm a 200ms Timeout yields a 504, not a hang.
Kill the origin mid-flight; confirm a 502 and that the access log (step 01) still fires with the error recorded.
Load-test with wrk2 -R5000 through the full chain (auth → route → proxy → log). Record p50/p99. Then make 10% of origin responses slow and show how a per-request timeout bounds the tail (the motivation for gw-06 hedging/retries).

Acceptance

Slow origin → 504 at the configured deadline; dead origin → 502; both logged with status class for metrics.
A filter panic anywhere still produces a clean 5xx and an access log (no dropped connections).
You can point to the Transport and Timeout seams and say exactly what gw-04 and gw-06 will inject there.

Discussion prompts

Why is a per-request timeout the first resilience primitive, before retries or circuit breakers? (Without it, everything else is built on unbounded latency.)
In Java/Netty, what specifically must ProxyEndpoint avoid doing on the event-loop thread, and how would you offload it if a filter truly needed blocking work?
This step buffers the response body. When must you stream it instead, and what does streaming cost you in observability and transformation ability?

gw-04 — Connection Management: Curbing Connection Churn

This lab is built directly on the Netflix talk named in the JD — "Curbing Connection Churn in Zuul." It's the most Netflix-specific topic in the phase and a perfect interview centerpiece because it ties together everything from gw-01 (sockets, TLS handshake cost) through gw-03 (the endpoint phase) into one concrete, measured win: Zuul went from opening thousands of origin connections per second to about 60, an ≈8× reduction in TCP opens, by fixing how the gateway manages connections to its backends.

The two ideas that did it: per-event-loop connection pools (so a request and its origin connection live on the same thread, eliminating cross-thread handoff and contention) and subsetting (so a large gateway fleet doesn't fan out a connection to every origin instance — it talks to a balanced subset). The subset is chosen with a low-discrepancy sequence (a binary Van der Corput sequence) so the load lands evenly even as the fleet and the origin set change.

You will build a connection pool, add per-event-loop partitioning, implement deterministic subsetting, and measure the churn drop with a load generator — reproducing the shape of the Netflix result.

1. What is it?

Connection churn is the rate at which you open and close TCP (and, worse, TLS) connections to your backends. Every new connection costs a TCP handshake (1 RTT) plus, for TLS, an expensive asymmetric-crypto handshake (1–2 RTT + CPU). A gateway that opens a fresh origin connection per request pays that tax constantly: latency on the request path and CPU burned on handshakes, plus pressure on the origin's accept queue (gw-01).

Connection management is the set of techniques that drive churn toward zero:

Keep-alive / connection reuse — don't close the origin connection after a response; reuse it for the next request to that origin.
Connection pooling — keep a pool of warm, reusable connections per origin; check one out per request, return it after.
Per-event-loop pools — partition the pool by event-loop thread so a request never has to acquire a connection owned by another thread (no lock contention, no cross-thread handoff, better cache locality).
Subsetting — each gateway instance connects to a subset of origin instances, not all of them, so total connections = gateways × subset_size instead of gateways × origins, while load stays balanced.
HTTP/2 origin multiplexing — one h2 connection carries many concurrent requests, so a handful of connections per origin suffices.

2. Why does it matter?

It's a top-line efficiency metric at Netflix. The talk's headline: ~8× fewer TCP opens, churn from thousands/s to ~60/s. At 1M+ rps across 80+ Zuul clusters, that's an enormous reduction in handshake CPU and tail latency. This is the kind of "tangible impact on the backbone" the JD is hiring for.
Churn is a hidden CPU sink. New engineers see high gateway CPU and look at filters; veterans look at connections.created.rate and the TLS handshake count. Recognizing churn as a CPU problem (not just a latency one) is a senior signal.
It's a fleet-scale combinatorics problem. Without subsetting, a fleet of N gateways × M origins = N×M connections; both grow, so the product explodes. A 500-instance gateway fleet to a 1000-instance origin = 500,000 connections, most idle, each a memory and keepalive cost on the origin. Subsetting turns N×M into N×subset. This is exactly the math the interviewer wants you to do out loud.
Balance is the hard part. Naive subsetting (random, or hash mod) creates hot origins and cold origins, especially as instances come and go. The low-discrepancy (Van der Corput) approach keeps the subset assignment evenly spread and stable under churn of the membership itself — change one gateway or origin and only a small, balanced part of the assignment moves.

3. How does it work?

The churn problem, drawn

NO POOLING (churn): every request handshakes a new origin connection
  req1 ─ TCP+TLS handshake ─▶ origin ─ response ─ CLOSE
  req2 ─ TCP+TLS handshake ─▶ origin ─ response ─ CLOSE      ← pay every time
  req3 ─ TCP+TLS handshake ─▶ origin ─ response ─ CLOSE

POOLING (no churn): handshake once, reuse
  req1 ─ checkout ─▶ [warm conn] ─▶ origin ─ response ─ return
  req2 ─ checkout ─▶ [warm conn] ─▶ origin ─ response ─ return   ← no handshake
  req3 ─ checkout ─▶ [warm conn] ─▶ origin ─ response ─ return

Per-event-loop pools (the Zuul insight)

A gateway runs K event-loop threads (gw-01/gw-03). If there's one shared pool, every checkout/return contends on a lock, and a connection accepted on loop A might be handed to a request running on loop B — a cross-thread handoff that hurts cache locality and ordering. The fix: one pool per event loop. A request running on loop A only ever uses connections owned by loop A. No lock, no handoff, the whole request/response cycle stays on one thread.

        ┌── event loop 0 ──┐   pool0: [conn→originX][conn→originY]
 reqs ─▶│   request runs    │── checkout/return within the loop, lock-free
        └───────────────────┘
        ┌── event loop 1 ──┐   pool1: [conn→originX][conn→originZ]
 reqs ─▶│   request runs    │
        └───────────────────┘

The trade-off (be ready to state it): per-loop pools mean each loop needs its own warm connections, so the minimum connection count rises with loop count. Subsetting is what keeps that in check.

Subsetting — `N×M` → `N×subset`

Each gateway picks a fixed-size subset of the origin instances to talk to. The requirements:

Balanced load on origins. Every origin should be in roughly the same number of gateways' subsets (so it gets ~equal traffic).
Stable under membership churn. When one gateway or one origin is added/removed, only a small, balanced portion of subset assignments should change (minimize connection re-establishment — otherwise re-subsetting causes churn).
No coordination. Each gateway computes its subset locally from the membership list (which the control plane provides, gw-08).

Low-discrepancy subsetting (Van der Corput)

The Netflix approach builds a balanced distribution ring using a low-discrepancy sequence. A low-discrepancy sequence fills the [0,1) interval as evenly as possible for any prefix length — unlike random points, which clump. The binary Van der Corput sequence is the canonical one: take the integer i, write it in binary, reverse the bits, and read it back as a fraction.

i   binary   reversed   value
0   .0        .0        0.0
1   .1        .1        0.5
2   .10       .01       0.25
3   .11       .11       0.75
4   .100      .001      0.125
5   .101      .101      0.625
6   .110      .011      0.375
7   .111      .111      0.875

Notice the values never clump: each new point lands in the largest remaining gap. Map gateways and origins onto this ring and each gateway deterministically picks the subset_size origins nearest its position; because the sequence is balanced, every origin is covered ~equally, and adding/removing one member shifts only a small balanced slice.

HTTP/2 origin multiplexing

If the origin speaks h2, one connection multiplexes hundreds of concurrent requests (gw-02), so you need only a few connections per origin per loop. This is the other big churn lever: h1 needs pool_size ≈ peak concurrency; h2 needs a handful. The trade-off is the h2 concentration/HOL issue from gw-02.

Connection lifecycle and health

A pooled connection isn't free to keep forever:

Idle eviction — close connections idle past a TTL so you don't hold thousands of cold sockets (and so a silently-dead origin gets cleaned up). But evict too aggressively and you re-create churn.
Max lifetime — recycle connections after a max age so DNS/endpoint changes and rebalancing eventually take effect (the gRPC MAX_CONNECTION_AGE tension from gw-02).
Health/validation — validate a connection on checkout (or rely on keepalive + TCP_USER_TIMEOUT, gw-01) so you don't hand out a half-dead connection and turn one origin failure into a latency cliff.

4. Core terminology

Term	Definition
Connection churn	The rate of opening/closing connections to backends; the thing to minimize.
Keep-alive	Reusing a connection for sequential requests instead of closing it.
Connection pool	A managed set of warm, reusable connections per origin.
Per-event-loop pool	A pool partitioned by event-loop thread; lock-free, no cross-thread handoff.
Subsetting	Each gateway connects to a subset of origins so total connections = `gateways × subset` not `gateways × origins`.
Low-discrepancy sequence	A sequence that fills an interval evenly for every prefix length (vs random clumping).
Van der Corput sequence	The canonical binary low-discrepancy sequence: reverse the bits of `i`.
Distribution ring	Members mapped onto `[0,1)`; each picks nearby members for its subset.
Idle eviction / max lifetime	TTLs that bound how long a pooled connection lives.
h2 origin multiplexing	Using one HTTP/2 connection per origin for many concurrent requests.
`connections.created.rate`	The metric that is connection churn; the thing you watch.

5. Mental models

Pooling is reusing a taxi; churn is calling a new taxi for every block. The handshake is the taxi pulling up and you buckling in. Keep the same taxi (keep-alive) and the per-trip overhead vanishes. TLS makes the "buckling in" a full safety briefing — expensive enough that reuse is a CPU win, not just latency.
Subsetting is potluck seating. 500 guests (gateways) and 1000 dishes (origins): nobody can sample every dish. Assign each guest a balanced subset so every dish gets eaten by about the same number of guests. Random seating leaves some dishes mobbed and others untouched; the Van der Corput seating chart spreads everyone evenly and barely reshuffles when one guest or dish is added.
Low-discrepancy = "always fill the biggest gap next." Random darts clump and leave holes; the Van der Corput sequence places each dart in the largest empty space. That's exactly the property you want when assigning a changing set of gateways to a changing set of origins without coordination.
A pool is a cache of expensive objects. All the cache problems apply: sizing (too small → churn, too big → idle waste), eviction (too eager → churn, too lazy → stale/dead connections), and the thundering herd when the cache is cold (a deploy empties every pool at once → a churn spike exactly when you can least afford it).

6. Common misconceptions

"Just make the pool huge." A huge pool holds thousands of idle connections — memory on both ends, keepalive traffic, and at fleet scale it's the N×M explosion subsetting exists to prevent. Size to peak per-loop concurrency, then subset.
"Subsetting hurts balance." Naive subsetting does. The whole point of the low-discrepancy approach is balanced coverage and stability under membership change. Done right, subsetting improves balance versus everyone-talks-to-everyone with random LB.
"HTTP/2 to the origin eliminates the need for pooling." It reduces connection count, but you still pool the (few) h2 connections, still manage their lifecycle, and now you have the concentration/HOL trade-off and the MAX_CONNECTION_AGE-vs-churn tension to manage.
"Idle eviction is purely good." Evicting idle connections fights churn reduction: evict too soon and you re-handshake on the next request. The right answer is a TTL tuned to traffic shape plus keepalive to detect dead peers — not aggressive eviction.
"Re-subsetting is cheap." Recomputing subsets on every membership change can cause a churn storm (drop and re-establish many connections at once). Stability-under-change is a first-class requirement, which is why the sequence choice matters.

7. Interview talking points

"How would you reduce connection churn at a gateway?" This is a near-certain question given the JD. Answer in layers: (1) keep-alive + pooling to stop per-request handshakes; (2) per-event-loop pools to kill lock contention and cross-thread handoff; (3) subsetting to stop the N×M fan-out; (4) low-discrepancy subset selection for balance + stability; (5) h2 origin multiplexing where applicable; (6) careful idle/lifetime TTLs. Quote the result shape: thousands/s → ~60/s, ≈8×.
"Do the connection math." 500 gateways, 1000 origins, everyone-to-everyone = 500k connections. With a subset of 20: 500×20 = 10k. State both numbers and the balance requirement.
"What's a low-discrepancy sequence and why use it here?" It fills the interval evenly for every prefix, so subset assignment is balanced and changes minimally when membership changes. The binary Van der Corput sequence is "reverse the bits of i." Contrast with random (clumps → hot/cold origins) and hash-mod (rebalances everything when the modulus/membership changes).
"Why per-event-loop pools specifically?" A request handled on loop A acquiring a connection from a shared pool means lock contention and possibly a connection owned by loop B (cross-thread handoff, cache misses, ordering hazards). Per-loop pools keep the entire request/response on one thread — the same "never leave the event loop" discipline from gw-03.
"What's the cost/downside of your scheme?" Per-loop pools raise the minimum connection count (each loop warms its own); subsetting trades a little resilience headroom (fewer origins per gateway) for efficiency, so the subset must be large enough to survive losing a few origins; idle TTLs trade memory for churn. Senior answers name the trade-offs unprompted.
"A deploy just caused a connection-churn spike. Why?" Cold pools: every restarted gateway re-establishes its connections at once (thundering herd). Mitigations: stagger the deploy, pre-warm pools, jitter reconnects, ramp traffic. Ties to gw-12.

8. Connections to other labs

gw-01 (L4) is where the handshake cost and TIME_WAIT live; this lab is the optimization layer over those sockets.
gw-02 (L7) — h2 origin multiplexing is a churn lever, with the concentration/HOL trade-off.
gw-03 (API gateway) — the pool is the Transport behind the endpoint filter; per-event-loop pools mirror the event-loop model.
gw-06 (resilience) — subsetting interacts with load balancing (P2C over the subset) and outlier ejection (eject within the subset).
gw-08 (Envoy/xDS) — the control plane supplies the membership list (EDS) that subsetting is computed from; Envoy has built-in subset LB.
gw-09 (Kubernetes networking) — EndpointSlices are the membership source in K8s; pod churn drives the stability-under-change requirement.

gw-04 — The Hitchhiker's Guide to Curbing Connection Churn

Companion to CONCEPTS.md, with the runnable code in src/go/connpool/. This lab reproduces the headline result of Netflix's "Curbing Connection Churn in Zuul" — in code you can run, measure, and break.

Connection churn is the rate at which you open/close connections to your backends. Every new connection costs a TCP handshake plus (for TLS) an expensive asymmetric-crypto handshake. At 1M+ rps a gateway that opens a fresh origin connection per request burns enormous CPU on handshakes and floods origins' accept queues (gw-01). The fix is three techniques, each a file in this package: pooling, per-event-loop pools, and subsetting.

Run bash scripts/verify.sh and you'll see the real numbers:

connection math:
  everyone-to-everyone : 500000 connections
  subset (size 20)      : 10000 connections  (50.0x fewer)
coverage per origin (ideal 10.00):
  min=5  p50=10  max=15
stability when one origin leaves (lower is better):
  Van der Corput ring : 11 / 500 gateways changed
  hash-mod subsetting : 259 / 500 gateways changed

Those three blocks are the lab. Let's earn each one.

1. Pooling: stop re-handshaking (pool.go)

Pool.Get(origin) returns a warm connection from the idle set if one exists (incrementing Reused), else dials a new one (incrementing Created — the churn counter). Put returns it for reuse, or closes it if the idle set is full.

The tests quantify the win:

TestNoPoolIsAllChurn: with maxIdle=0, 100 get/put cycles produce Created == 100 — pure churn, a handshake every request.
TestPoolingReducesChurn: with maxIdle=8, the same 100 cycles produce Created == 1, Reused == 99 — one warm connection, reused.

That's the entire pooling thesis in two tests: churn collapses from "every request" to "≈zero."

The eviction trap (over-tuning re-creates churn)

TestIdleEvictionCausesChurn uses an injectable clock (p.nowFn) to prove the failure mode: with a too-short idleTTL, a connection put back and retrieved after the TTL is evicted and re-dialed — Created climbs again. This is the bug juniors introduce when they "clean up idle connections aggressively": they reintroduce the very churn pooling removed. The right idleTTL is tuned to the traffic's inter-request gap, backed by keepalive (gw-01) to detect genuinely dead peers — not set to "a few hundred ms to be safe."

2. Per-event-loop pools (looppools.go)

A single shared pool means every checkout/return contends on one lock, and a connection accepted on loop A might be handed to a request running on loop B — a cross-thread handoff that hurts cache locality and ordering. LoopPools gives each event loop its own pool; a request on loop i only ever touches pool[i]. Lock-free, no handoff — the specific change the Netflix talk highlights.

TestPerLoopIndependence shows the trade-off honestly: loop 0 and loop 1 each dial their own connection to the same origin (created == 2), then loop 0 reuses its own (reused == 1). So per-loop pools raise the minimum connection count ~K×. That's not a regression — it's why subsetting exists, and the two are designed to be used together. The contention you remove (at 1M+ rps, one lock is a real bottleneck) is worth far more than a modest rise in idle connections, and subsetting claws that rise back.

3. Subsetting (ring.go) — the `N×M` → `N×subset` fix

Without subsetting, N gateways × M origins = N×M connections. TestSubsetConnectionMath does the arithmetic: 500 gateways × 1000 origins = 500,000 connections; with a subset of 20, 10,000 — a 50× reduction. Both numbers grow with the fleet, so the product is the thing that explodes; subsetting turns multiplication into addition.

But naive subsetting fails two ways, and the ring fixes both:

Balance — Van der Corput, not random

VanDerCorput(i) bit-reverses i and reads it as a fraction in [0,1). TestVanDerCorputKnownValues pins the sequence: 0, .5, .25, .75, .125, .625, .375, .875. Notice it never clumps — each new point lands in the largest remaining gap. We place gateways on the ring with this sequence (gateways have stable indices 0..G-1), so their subset "windows" evenly cover the origins. TestSubsetBalance asserts every origin is covered (no cold origins) and the max coverage stays within a sane band of the ideal — the demo shows min=5, p50=10, max=15 against an ideal of 10. Random placement would clump (hot and cold origins); Van der Corput spreads.

Stability — hash for origin identity, not slice index

The subtle, maintainer-level point: origins are positioned by a stable hash of their identity (originPos = FNV-1a / 2³²), not by their index in a slice. Why it matters: if you positioned origins by slice index and one origin left, every later origin's index — and thus ring position — would shift, reshuffling the whole fleet's subsets and causing a re-subsetting churn storm (the fix causing the problem). With stable hashed positions, removing one origin only perturbs the gateways whose window touched it.

TestSubsetStabilityVsHashMod proves it, and the demo quantifies it: when one origin leaves, the Van der Corput ring changes 11/500 gateways' subsets; hash-mod changes 259/500. Hash-mod uses (seed+j) mod len, so changing len moves nearly everything — exactly the instability that makes re-subsetting dangerous. Stability under membership change is a first-class requirement, because EDS/Endpoint- Slice churn (gw-08/gw-09) means membership changes constantly.

This is the detail that separates "I read the blog" from "I could have written it": the low-discrepancy sequence gives balance, and stable per-origin positioning gives stability under churn. You need both, and they come from different mechanisms.

4. How it composes with the rest of the phase

The pool is the http.RoundTripper/Transport behind gw-03's endpoint filter — swap it in and the gateway stops churning origin connections.
The subset is computed from the EDS membership an xDS control plane pushes (gw-08), which itself derives from Kubernetes EndpointSlices (gw-09). Rapid pod churn is why stability matters; you'd debounce re-subsetting so the fix doesn't thrash.
Load balancing (gw-06: P2C, outlier ejection) operates within the subset, not over the whole fleet.
HTTP/2 origin multiplexing (gw-02) is the other churn lever: a handful of h2 connections per origin carry hundreds of concurrent requests — at the cost of the concentration/HOL trade-off.

5. Hands-on

cd src/go
bash ../scripts/verify.sh                 # tests + the demo above

# Play with the parameters and watch the trade-offs:
go run ./cmd/churnsim -origins 2000 -gateways 1000 -subset 10   # tiny subset: fewer conns, less resilience headroom
go run ./cmd/churnsim -origins 100  -gateways 50   -subset 50   # subset≈origins: approaches full mesh

Questions the demo answers experimentally:

How does shrinking the subset trade connection count against the number of origins each gateway can lose before it's under-provisioned?
How does coverage balance change with subset size and fleet ratio?
How does ring stability compare to hash-mod as you change origins?

6. Exercises

Wire the pool into gw-03: implement an http.RoundTripper backed by Pool (one per origin), drop it into ProxyEndpoint.Transport, and watch Created flatten under wrk keep-alive load vs Connection: close load.
Subset-aware pooling: make each per-loop pool only dial origins in this gateway's subset. Re-measure IdleTotal() — confirm the per-loop K× rise (§2) is bounded by the subset size.
Debounce re-subsetting: feed a rapidly-flapping membership list and add a debounce so the ring rebuilds at most every N ms. Show that without it, re-subsetting itself causes churn (the fix becoming the bug).
Size the subset for resilience: derive how large the subset must be so that losing f origins still leaves enough capacity — a quorum-style argument (connect it to db-17's majority reasoning).
Compare to consistent hashing: implement a consistent-hash ring and compare its stability and balance to Van der Corput placement. When would you pick each?

gw-04 — References

The primary source (read first — it's named in the JD)

Curbing Connection Churn in Zuul — Netflix TechBlog, 2023. The per-event-loop pool + subsetting work; the ≈8× / thousands→~60 result; the low-discrepancy (Van der Corput) distribution ring. https://netflixtechblog.com/curbing-connection-churn-in-zuul-2feb273a3598
Arthur Gonigberg, Curbing Connection Churn (companion write-up). https://arthur.gonigberg.com/2023/10/03/curbing-connection-churn/

Subsetting & balanced distribution

Google SRE Book / Site Reliability Engineering, "Load Balancing in the Datacenter" — deterministic subsetting: why N×M doesn't scale and how subset selection keeps balance. https://sre.google/sre-book/load-balancing-datacenter/
Van der Corput / low-discrepancy sequences — the math behind the balanced ring. (Any quasi-Monte-Carlo reference; the binary Van der Corput sequence is "bit-reverse i".)
Twitter/Finagle Aperture load balancer — another production take on subsetting with a ring (deterministic aperture). https://twitter.github.io/finagle/guide/Clients.html#aperture-load-balancers

Pooling & keep-alive

Go net/http.Transport — MaxIdleConnsPerHost, IdleConnTimeout, MaxConnsPerHost, ForceAttemptHTTP2; read the pooling logic.
Envoy connection pools (per-worker-thread pools — the same per-event-loop idea) and subset load balancer docs. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/subsets
gRPC MAX_CONNECTION_AGE / keepalive — the rebalancing-vs-churn tension from gw-02.

TLS handshake cost (why churn is a CPU problem)

RFC 8446 (TLS 1.3) handshake; session resumption / 0-RTT.
Cloudflare blog — TLS handshake cost, session resumption, and why connection reuse is a CPU optimization.

Tooling

ss -s / ss -tan state established | wc -l — live connection counts.
nstat -az | grep -i active — TcpActiveOpens (outbound connections opened) — the churn counter on the gateway side.
wrk/wrk2 with and without --latency; toggle keep-alive to see churn (a non-keepalive client maximizes it).

Cross-lab dependencies

Upstream: gw-01 (handshake/TIME_WAIT), gw-02 (h2 multiplexing), gw-03 (the endpoint Transport).
Downstream: gw-06 (LB over the subset), gw-08 (EDS membership), gw-09 (EndpointSlices, pod churn).

gw-04 — Analysis

The design-review treatment of the pool + subsetting you build in steps/, and the trade-offs to defend.

Required behaviors

Reuse by default. A second request to the same origin must not open a new connection while a warm one is idle in the pool.
Bounded pool. Per-(origin, loop) pool has a max size; over the cap, either queue with a timeout or open a temporary connection — a policy decision, stated explicitly.
Balanced subset. Across the fleet, each origin appears in ~gateways × subset / origins subsets (even coverage).
Stable subset. Adding/removing one member changes only a small, balanced fraction of assignments — re-subsetting must not itself cause a churn storm.
Lifecycle hygiene. Idle eviction (TTL) and max-lifetime recycling, with keepalive so a dead origin is detected, not handed out.

Design decisions

Pool keyed by (origin, eventLoopID). This is the Zuul insight made concrete: the key includes the loop so a request on loop A never touches loop B's connections — lock-free, no handoff. The lab simulates K loops with K goroutine-bound pools.
Subset selection by Van der Corput ring. Map each member to a point on [0,1) via the bit-reversed sequence; each gateway takes the subset_size origins nearest its own point. Deterministic, no coordination, balanced, and stable under membership change. The step measures coverage variance to prove balance.
h1 and h2 modes. The pool supports both: h1 pools ≈ concurrency connections per origin; h2 pools a few and multiplexes. The step shows the connection-count difference directly.
Churn measured, not asserted. The lab's whole point is the metric: connections.created counter, sampled per second, plotted with and without each technique. You reproduce the shape of the Netflix result (a big drop and a flat line), not an exact number.

Tradeoffs worth flagging

Per-loop pools raise minimum connection count. With K loops you warm up to K × the connections of a shared pool. Subsetting offsets this; the two techniques are designed to be used together, and the net is still far below the no-subset baseline.
Subset size is a resilience knob. A small subset minimizes connections but means losing a few origins removes a large fraction of a gateway's capacity to that service. Size the subset so that losing f origins still leaves enough; this is a quorum-flavored argument (echoes db-17's majority reasoning).
Idle eviction vs churn. Aggressive eviction frees memory but re-handshakes on the next request — re-creating the very churn you removed. Tune the idle TTL to the traffic's inter-request gap; validate with the churn metric.
Stability vs balance under change. Perfectly balanced static assignment and minimal movement under change are in slight tension; the low-discrepancy ring is the sweet spot. Hash-mod is maximally unstable (everything moves when membership changes); fully random is unbalanced. Be able to rank the three.
The cold-pool thundering herd. A fleet deploy empties every pool; all gateways re-establish at once → a churn spike precisely during a deploy. Mitigations are operational (gw-12): staggered rollout, pre-warming, reconnect jitter, traffic ramp.

What production adds beyond this lab

Pool acquisition backpressure integrated with admission control (gw-06): when the pool is saturated, shed load rather than queue unboundedly.
Outlier ejection inside the subset (gw-06): a bad origin in your subset is removed from rotation without re-subsetting the whole ring.
Control-plane-driven membership (gw-08 EDS / gw-09 EndpointSlices) with debounced re-subsetting so rapid pod churn doesn't thrash the ring.
Per-origin protocol negotiation (ALPN) so the pool automatically uses h2 where available and h1 otherwise.
Rich churn observability: connections.created.rate, pool-utilization, acquisition-wait histogram, eviction counters (gw-11).

gw-04 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd gw-04-connection-management && bash scripts/verify.sh   # tests + churnsim demo

Per-language workflow (Go)

cd gw-04-connection-management/src/go
go test -race -count=1 ./...        # pool, per-loop, subsetting tests
go run ./cmd/churnsim -origins 1000 -gateways 500 -subset 20

churnsim flags

flag	default	meaning
`-origins`	1000	number of origin instances (M)
`-gateways`	500	number of gateway instances (N)
`-subset`	20	subset size per gateway

Output shows: the N×M → N×subset connection reduction, coverage balance (min/p50/max vs ideal), and stability (ring vs hash-mod) when one origin leaves.

Package map

File	What
`connpool/pool.go`	keep-alive pool: Get/Put, Created/Reused/Closed, idle TTL
`connpool/looppools.go`	per-event-loop pools (lock-free, no handoff)
`connpool/ring.go`	Van der Corput sequence, the distribution ring, subsetting, coverage, hash-mod baseline
`cmd/churnsim`	the no-network demonstration of all three results

See GUIDE.md for the full deep dive.

gw-04 — Verification

One command

cd gw-04-connection-management && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestPoolReuse`	a put-then-get reuses the warm connection (`Created==1`, `Reused==1`)
`TestNoPoolIsAllChurn`	`maxIdle=0` → every request dials (`Created==100`) — pure churn baseline
`TestPoolingReducesChurn`	`maxIdle=8` → 100 cycles dial once, reuse 99 times
`TestIdleEvictionCausesChurn`	a too-short idle TTL evicts and re-dials — over-tuning re-creates churn
`TestPerLoopIndependence`	per-loop pools each dial their own conn (the `K×` trade-off), then reuse within a loop
`TestVanDerCorputKnownValues`	sequence is exactly `0, .5, .25, .75, .125, .625, .375, .875`
`TestSubsetConnectionMath`	subsetting cuts `N×M` to `N×subset` (≥10× here)
`TestSubsetBalance`	every origin is covered (no cold origins); max coverage within a sane band of ideal
`TestSubsetStabilityVsHashMod`	the ring is strictly more stable than hash-mod and does not reshuffle everything when one origin leaves

All under -race.

Demo assertions (churnsim, in verify.sh)

The demo prints, for 1000 origins / 500 gateways / subset 20:

500000 → 10000 connections (50× fewer),
coverage min/p50/max ≈ 5/10/15 against ideal 10,
on one origin leaving: ring ~11/500 vs hash-mod ~259/500 gateways changed.

What "green" does NOT guarantee

No real sockets. The pool is abstracted over a Resource so churn is measured deterministically; wiring it to net.Conn/http.Transport is an exercise (GUIDE §6.1).
No live LB within the subset. P2C/outlier ejection over the subset is gw-06.
No debounced re-subsetting. Membership-churn debouncing is an exercise (GUIDE §6.3); production needs it so the fix doesn't cause churn.

gw-04 step 01 — A connection pool, and measuring churn

Goal

Build a minimal origin connection pool with keep-alive, then measure the churn drop versus a no-pool baseline. The metric is the lesson: you want to see connections.created.rate collapse.

Code — `src/go/pool.go`

package pool

import (
	"net"
	"sync"
	"sync/atomic"
	"time"
)

// Dialer opens a new connection to an origin (TCP+TLS happens here —
// the expensive part we want to avoid repeating).
type Dialer func(origin string) (net.Conn, error)

type Pool struct {
	dial    Dialer
	maxIdle int
	idleTTL time.Duration

	mu   sync.Mutex
	idle map[string][]*pooledConn // origin -> idle connections

	Created atomic.Int64 // THE churn metric: total connections opened
	Reused  atomic.Int64
}

type pooledConn struct {
	net.Conn
	origin   string
	idleSince time.Time
}

func New(dial Dialer, maxIdle int, idleTTL time.Duration) *Pool {
	return &Pool{dial: dial, maxIdle: maxIdle, idleTTL: idleTTL,
		idle: map[string][]*pooledConn{}}
}

// Get returns a warm connection if one is available, else dials a new
// one (and counts it as churn).
func (p *Pool) Get(origin string) (*pooledConn, error) {
	p.mu.Lock()
	q := p.idle[origin]
	for len(q) > 0 {
		c := q[len(q)-1]
		q = q[:len(q)-1]
		if time.Since(c.idleSince) > p.idleTTL { // expired: close, keep looking
			c.Conn.Close()
			continue
		}
		p.idle[origin] = q
		p.mu.Unlock()
		p.Reused.Add(1)
		return c, nil // REUSE: no handshake
	}
	p.idle[origin] = q
	p.mu.Unlock()

	conn, err := p.dial(origin) // CHURN: a new TCP+TLS handshake
	if err != nil {
		return nil, err
	}
	p.Created.Add(1)
	return &pooledConn{Conn: conn, origin: origin}, nil
}

// Put returns a connection to the pool for reuse (or closes it if full).
func (p *Pool) Put(c *pooledConn) {
	c.idleSince = time.Now()
	p.mu.Lock()
	defer p.mu.Unlock()
	q := p.idle[c.origin]
	if len(q) >= p.maxIdle {
		c.Conn.Close() // pool full: this becomes churn next time
		return
	}
	p.idle[c.origin] = append(q, c)
}

In real Go you'd often just configure http.Transport (MaxIdleConnsPerHost, IdleConnTimeout) — and the step shows that too. The hand-rolled pool exists so the Created counter is explicit and you can watch churn.

Measure it

// Churn sampler: print connections-created-per-second.
func sample(p *Pool, stop <-chan struct{}) {
	var last int64
	t := time.NewTicker(time.Second)
	for {
		select {
		case <-stop:
			return
		case <-t.C:
			now := p.Created.Load()
			rate := now - last
			last = now
			reuse := p.Reused.Load()
			fmt.Printf("connections.created/s=%d  total_created=%d  reused=%d\n",
				rate, now, reuse)
		}
	}
}

Run two experiments against a local origin under fixed load (wrk2 -R20000):

No pool (Put always closes / maxIdle=0): created/s tracks request rate — thousands per second.
With pool (maxIdle=64, sane idleTTL): created/s falls to a trickle after warmup; reused climbs to ≈ request count.

Tasks

Implement Pool; wire it as the Transport behind gw-03's endpoint filter (or a standalone loop that Get/Puts per request).
Plot connections.created/s for maxIdle=0 vs maxIdle=64 under identical load. Capture the drop (this is the gw-04 result in miniature).
Tune idleTTL too low (e.g. 100ms) and show churn comes back — proving over-eager eviction fights pooling.

Acceptance

created/s collapses from ≈request-rate (no pool) to near-zero (pooled), and reused ≈ total requests.
You can produce a churn resurgence by setting idleTTL too low and explain why.

Discussion prompts

Why is churn a CPU problem and not only a latency problem? (TLS asymmetric crypto per handshake.)
What's the right maxIdle? (≈ peak concurrent requests to that origin on this instance — tie to Little's law, gw-06.)
This pool has one global lock. On a multi-loop gateway at 1M+ rps, why is that lock a problem, and what does step 02 do about it?

gw-04 step 02 — Per-event-loop pools

Goal

Eliminate the single global pool lock and the cross-thread handoff by giving each event loop its own pool. A request handled on loop i only ever uses connections owned by loop i — lock-free, cache-local, the entire request/response on one thread. This is the specific change the Netflix talk highlights.

Code — `src/go/loop_pools.go`

package pool

import (
	"net"
	"time"
)

// LoopPools is a set of K independent pools, one per event loop. There
// is NO shared lock across loops: pool[i] is only ever touched by the
// goroutine(s) bound to loop i.
type LoopPools struct {
	pools []*Pool
}

func NewLoopPools(k int, dial Dialer, maxIdle int, idleTTL time.Duration) *LoopPools {
	lp := &LoopPools{pools: make([]*Pool, k)}
	for i := range lp.pools {
		lp.pools[i] = New(dial, maxIdle, idleTTL)
	}
	return lp
}

// For returns the pool owned by event loop `loopID`. The caller is the
// loop goroutine, so within one loop the *Pool's* lock is effectively
// uncontended (only this loop uses it).
func (lp *LoopPools) For(loopID int) *Pool { return lp.pools[loopID] }

// Totals aggregates churn across loops for the metric.
func (lp *LoopPools) Totals() (created, reused int64) {
	for _, p := range lp.pools {
		created += p.Created.Load()
		reused += p.Reused.Load()
	}
	return
}

Wiring — one worker goroutine per loop, pinned to its pool

This models Netty's EventLoopGroup: K workers, each draining its own request queue, each using only its own pool.

func runWorker(loopID int, lp *LoopPools, reqs <-chan Request) {
	p := lp.For(loopID) // this worker's pool — no other worker touches it
	for r := range reqs {
		c, err := p.Get(r.Origin)
		if err != nil {
			r.Done(err)
			continue
		}
		// ... write request / read response on c (same thread) ...
		p.Put(c) // returned to THIS loop's pool
		r.Done(nil)
	}
}

func main() {
	k := runtime.GOMAXPROCS(0)
	lp := NewLoopPools(k, dialTLS, 32, 30*time.Second)
	queues := make([]chan Request, k)
	for i := 0; i < k; i++ {
		queues[i] = make(chan Request, 1024)
		go runWorker(i, lp, queues[i])
	}
	// Accept connections and assign each to a loop by a stable hash so a
	// connection's requests always land on the same loop (affinity).
	acceptLoop(func(conn net.Conn) {
		loopID := stableHash(conn.RemoteAddr()) % k
		// ... read requests from conn, push to queues[loopID] ...
	})
}

The trade-off to measure

Per-loop pools mean each loop warms its own connections, so the minimum connection count rises ~K× versus a shared pool. Measure it:

shared pool:    min connections ≈ peak_concurrency_per_origin
per-loop pools: min connections ≈ K × peak_concurrency_per_loop_per_origin

This is exactly why subsetting (step 03) is needed to keep the total in check. The win is that the contention and handoff costs vanish, which matters far more at 1M+ rps than a modest rise in idle connections — and subsetting claws the count back.

Tasks

Convert step 01's single pool into LoopPools with K = GOMAXPROCS workers, each pinned to its own pool.
Under fixed load, compare lock contention: profile the shared-pool version (go test -bench with -mutexprofile, or pprof) vs the per-loop version. Show the mutex contention on the shared pool and its absence per-loop.
Record the connection-count trade-off: per-loop pools hold more idle connections. Note the number — step 03 will reduce it with subsetting.

Acceptance

Per-loop version shows ~zero cross-loop mutex contention in the profile; shared version shows contention rising with K.
You can state the connection-count trade-off with real numbers and explain why subsetting is the companion fix.

Discussion prompts

Why is connection affinity (a connection's requests always hitting the same loop) important here? What breaks if a request can hop loops mid-flight?
In Java/Netty this is automatic because a Channel is bound to one EventLoop for its lifetime. How does that compare to the goroutine-per-connection Go model, and where does each pay a cost?
The per-loop trade-off raises idle connections by ~K×. Argue why that's still the right call at Netflix scale before subsetting.

gw-04 step 03 — Subsetting with a Van der Corput ring

Goal

Stop every gateway from connecting to every origin. Each gateway picks a balanced subset of origins using a low-discrepancy (Van der Corput) distribution ring, so total connections = gateways × subset while load stays even and barely moves when membership changes. This is the core of the Netflix churn win.

Code — the Van der Corput sequence

package subset

// vanDerCorput returns the i-th value of the binary Van der Corput
// low-discrepancy sequence in [0,1): write i in binary, reverse the
// bits, read back as a fraction. Successive points always land in the
// largest remaining gap — no clumping.
func vanDerCorput(i uint32) float64 {
	var rev uint32
	bits := 32
	x := i
	for b := 0; b < bits; b++ {
		rev = (rev << 1) | (x & 1)
		x >>= 1
	}
	return float64(rev) / float64(1<<32) // rev / 2^32  in [0,1)
}

Code — map members to the ring and pick a subset

package subset

import "sort"

type Ring struct {
	points []ringPoint // origins placed on [0,1), sorted
}
type ringPoint struct {
	pos    float64
	origin string
}

// BuildRing places each origin deterministically on the ring using the
// Van der Corput sequence indexed by a stable per-origin index.
func BuildRing(origins []string) *Ring {
	pts := make([]ringPoint, len(origins))
	for i, o := range origins {
		pts[i] = ringPoint{pos: vanDerCorput(uint32(i)), origin: o}
	}
	sort.Slice(pts, func(a, b int) bool { return pts[a].pos < pts[b].pos })
	return &Ring{points: pts}
}

// Subset returns `size` origins for the gateway placed at gwPos: walk
// clockwise from gwPos and take the next `size` distinct origins. Two
// gateways at nearby positions overlap a lot; gateways spread evenly
// (because gwPos also comes from the Van der Corput sequence) cover the
// origins evenly.
func (r *Ring) Subset(gwPos float64, size int) []string {
	if size >= len(r.points) {
		out := make([]string, len(r.points))
		for i, p := range r.points {
			out[i] = p.origin
		}
		return out
	}
	// find first point >= gwPos
	start := sort.Search(len(r.points), func(i int) bool {
		return r.points[i].pos >= gwPos
	})
	out := make([]string, 0, size)
	for i := 0; i < size; i++ {
		out = append(out, r.points[(start+i)%len(r.points)].origin)
	}
	return out
}

// GatewayPosition places gateway g on the same ring via Van der Corput,
// so the whole fleet is evenly distributed without coordination.
func GatewayPosition(gatewayIndex int) float64 {
	return vanDerCorput(uint32(gatewayIndex))
}

Prove balance and stability

// coverage[origin] = number of gateways whose subset includes it.
// Even coverage => even load on origins.
func coverage(origins []string, gateways, subsetSize int) map[string]int {
	ring := BuildRing(origins)
	cov := map[string]int{}
	for g := 0; g < gateways; g++ {
		for _, o := range ring.Subset(GatewayPosition(g), subsetSize) {
			cov[o]++
		}
	}
	return cov
}

Experiments:

Connection math. origins=1000, gateways=500, subset=20. Total connections = 500 × 20 = 10,000 vs 500 × 1000 = 500,000 everyone-to-everyone. Print both.
Balance. Compute coverage and report min/max/stddev. Compare Van der Corput placement vs naive rand.Float64() placement — the random version has higher variance (hot/cold origins).
Stability under churn. Remove one origin, rebuild the ring, and count how many (gateway → origin) assignments changed. Compare to hash-mod subsetting (origins[(hash(gw)+i) % len]), which reshuffles almost everything when len changes. The ring moves only a small, balanced fraction.

Tasks

Implement vanDerCorput, BuildRing, Subset, and the coverage analysis.
Print the connection-count reduction for a realistic fleet size.
Show, with numbers: (a) lower coverage variance than random; (b) far fewer assignment changes than hash-mod when one member leaves.
Combine with step 02: each per-loop pool only dials origins in this gateway's subset. Re-measure total idle connections — the K× rise from step 02 is now bounded by the subset size.

Acceptance

Correct Van der Corput values (0, .5, .25, .75, .125, ...).
A printed N×M → N×subset reduction (e.g. 500k → 10k).
Quantified balance (low coverage variance) and stability (few reassignments on membership change) versus random and hash-mod.

Discussion prompts

Why does bit-reversal produce an evenly-spread sequence, intuitively? (Each new point bisects the largest existing gap.)
How big must the subset be so that losing f origins still leaves enough capacity? (A quorum-style argument — connect it to db-17's majority reasoning.)
Membership comes from the control plane (gw-08 EDS / gw-09 EndpointSlices). Rapid pod churn would rebuild the ring constantly. How do you debounce re-subsetting so the fix doesn't cause churn?
Where does this interact with load balancing (gw-06)? (You still P2C within the subset and eject outliers within it.)

gw-05 — WebSockets & Persistent-Connection Proxies (Pushy)

Two of the talks in the JD — "Pushy to the Limit" and "Scaling Push Messaging for Millions of Devices" — are about one system: Pushy, Netflix's WebSocket proxy. It holds hundreds of millions of concurrent, persistent WebSocket connections from devices, and lets backend services push a message to any specific device in real time. It sustains a steady 99.999% message-delivery reliability, and the team pushed per-node density from ~60k to ~200k connections (headroom to ~400k) by switching instance types and reworking the architecture.

This lab is where the gateway model inverts. Everywhere else, the client makes a request and you proxy it (request/response, short-lived). Here the client opens one long-lived connection and keeps it open for hours or days, and the server initiates messages. That inversion changes everything: connection density becomes the dominant cost, routing becomes "find the node holding this device's connection," and draining a node for a deploy becomes the hardest operational problem in the phase.

You will build a WebSocket server, a push registry (device → node), and an async delivery path (a backend publishes; the right node pushes to the right device) — Pushy in miniature.

1. What is it?

A persistent-connection proxy maintains a standing connection to each client so messages can flow server → client at any time, not just in response to a client request. WebSocket (RFC 6455) is the canonical transport: it starts as an HTTP request with an Upgrade: websocket header, then the same TCP connection switches to a bidirectional, framed, full-duplex message channel.

client ── HTTP GET /ws  Upgrade: websocket ──▶ Pushy node
client ◀── 101 Switching Protocols ───────────  (handshake done)
        ══════ full-duplex WebSocket frames ══════   (stays open for hours)
        ◀── server can push a message at ANY time ──

Pushy's job is twofold:

Hold the connections. Terminate hundreds of millions of WebSockets across a fleet, keep them alive (ping/pong), authenticate them, and survive deploys without dropping them carelessly.
Route messages to them. When a backend wants to reach device D, look up which Pushy node holds D's connection in the push registry, deliver the message there, and that node writes it down the socket.

The supporting cast (from the talks):

Push registry — a low-latency key-value store mapping deviceId → {node, connection metadata}. Netflix moved it from Dynomite (a Redis wrapper) to KeyValue, an internal KV service.
Message processor — a standalone Spring Boot service consuming from Kafka; backends publish "send X to device D" events, the processor looks up the registry and forwards to the owning node.
Async, decoupled delivery — the push path is event-driven through Kafka so a slow or redeploying Pushy node doesn't block publishers.

2. Why does it matter?

It's a flagship system of the team. Two of seven linked talks are about it. Being able to whiteboard Pushy — connection density, push registry, async delivery, graceful drain — is close to a guaranteed win in the loop.
Persistent connections break every assumption from the rest of the phase. No connection pooling to clients (the connection is the session). Load balancing is "connections," not "requests," and it's sticky for the connection's whole life, so a node can stay hot long after the LB stopped sending it new connections (the gw-01 SO_REUSEPORT caveat, at its most extreme). State per connection is the dominant memory cost.
Graceful drain is genuinely hard here. You cannot drop 200,000 live connections to deploy. You must coordinate reconnect: signal clients to reconnect (with backoff + jitter so they don't thunder onto the next node), drain over minutes not seconds, and keep the push registry consistent as devices migrate between nodes.
It's a real-time, at-least-once delivery system. 99.999% reliability over an unreliable mobile network means reconnect logic, message acknowledgment, idempotency, and dealing with a device that's briefly offline — distributed-systems delivery semantics applied to the edge.

3. How does it work?

The WebSocket handshake and framing

The upgrade is an HTTP/1.1 request with specific headers; the server proves it understood by hashing the client key with a magic GUID:

GET /ws HTTP/1.1
Host: push.netflix.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

→ 101 Switching Protocols
  Sec-WebSocket-Accept: base64(sha1(key + "258EAFA5-E914-47DA-95CA-C5AB0DC85B11"))

After 101, the connection speaks WebSocket frames: a small header (FIN bit, opcode — text/binary/close/ping/pong, mask bit, payload length 7/16/64-bit), then payload. Client→server frames are masked (XOR with a 4-byte key) to defeat cache-poisoning of intermediaries; server→client frames are not. Ping/pong frames keep the connection alive through NATs and detect dead peers (gw-01 keepalive at the app layer).

The push architecture

                         ┌──────────────── Pushy fleet ───────────────┐
 device D ═══WS════════▶ │ node N2  (holds D's live connection)         │
                         │   on connect: registry[D] = N2              │
                         └───────────────▲─────────────────────────────┘
                                         │ deliver(D, msg) → write down socket
   backend ──"send msg to D"──▶ Kafka ──▶ Message Processor ──registry.lookup(D)=N2
                                         │
                              Push Registry (KeyValue): deviceId → node

Device connects to some node N2; N2 authenticates it and writes registry[D] = N2 (+ connection metadata, with a TTL).
A backend publishes "send msg to D" to Kafka (fire-and-forget, fast).
The Message Processor consumes it, looks up registry[D] = N2, forwards the message to N2.
N2 writes the message down D's socket. If D isn't connected anywhere (registry miss / stale), the message is dropped or queued per policy.

Connection density — the dominant constraint

At 200k connections/node, the per-connection cost is everything:

Memory: read/write buffers, TLS state, app session state — keep it tiny. At 200k × 10KB = 2GB just for connection state.
File descriptors: 200k fds/node → tune ulimit/fs.file-max.
Event-loop threads: Netty event loops multiplexing all of them; never block (gw-03), or you stall thousands of connections.
Heartbeat cost: ping/pong to 200k connections is real traffic and CPU; tune the interval against NAT timeout and detection latency.

The talk's density jump (60k→200k→400k) came from a better instance type and trimming per-connection overhead — a profiling-and-budgeting exercise.

Graceful drain (the hard part)

deploy node N2:
  1. stop accepting NEW connections (LB readiness off — gw-09)
  2. tell connected devices to reconnect (a "go away" app message)
  3. devices reconnect to OTHER nodes with backoff + JITTER
     (no jitter ⇒ a reconnect thundering herd onto the next node)
  4. as each device migrates, registry[D] updates to its new node
  5. when connections drain below a threshold (or a deadline), exit

Draining 200k connections takes minutes; the deploy system must tolerate that (long terminationGracePeriodSeconds, gw-09). Botching the jitter turns a deploy into a self-inflicted DDoS on the rest of the fleet — a classic war story to have ready.

4. Core terminology

Term	Definition
WebSocket (RFC 6455)	HTTP-upgraded, full-duplex, framed message channel over one TCP connection.
Upgrade handshake	The `Upgrade: websocket` request → `101 Switching Protocols` response.
Frame	The WebSocket transmission unit: opcode (text/binary/close/ping/pong) + masked payload.
Masking	Client→server payloads XOR'd with a per-frame key to defeat intermediary cache poisoning.
Ping/pong	Keepalive/liveness frames; detect dead peers and hold NAT mappings open.
Persistent connection	A long-lived (hours/days) connection where the server can push at any time.
Push registry	`deviceId → owning node` map (Netflix: KeyValue, formerly Dynomite).
Message processor	Kafka consumer that routes "send to device D" events to the owning node.
Connection density	Concurrent connections per node — the dominant cost (Netflix: ~200k, up to ~400k).
Reconnect storm	Many clients reconnecting simultaneously; tamed with backoff + jitter.
At-least-once delivery	Delivery semantics requiring acks + idempotency for reliability.
SSE	Server-Sent Events: a simpler, server→client-only alternative to WebSockets.

5. Mental models

A request/response gateway is a restaurant; Pushy is a switchboard. Diners come, order, eat, leave (short-lived). A switchboard keeps a line open to every subscriber for hours and rings them when there's a call. The cost model flips from "throughput of orders" to "how many lines can one operator hold open."
The push registry is a hotel front desk. To deliver a message to a guest you don't search every room — you ask the desk which room (node) they're in. Keep the desk's records fresh (TTL, update on check-in/out) or you'll knock on empty rooms (stale registry → lost messages).
Reconnects without jitter are a stampede through one door. When a node drains, every device bolts for the exit at once; if they all run to the same next node simultaneously you've moved the fire, not put it out. Jitter spreads the stampede over time and across nodes.
Decoupling via Kafka is a mailbox, not a phone call. Publishers drop a letter and move on; the message processor delivers when it can. A redeploying node doesn't block the publisher — the letter waits. The cost is the delivery delay and the need for the registry to be right when the letter is processed.

6. Common misconceptions

"WebSockets are just a faster HTTP." They invert the model: server-initiated, long-lived, stateful. The hard problems (density, drain, routing-to-a-connection) don't exist in request/response and are the whole job here.
"Load balancing persistent connections is the same as requests." It's connection-sticky for the connection's entire life. New connections balance; existing ones don't move. A node can be hot for hours after the LB stops sending it new connections — you rebalance by forcing reconnects, which is exactly the drain problem.
"Just scale out to add capacity." New nodes only get new connections; existing connections stay put. To actually shift load you must cycle connections (drain/reconnect) — adding nodes alone doesn't cool a hot one.
"The push registry can be eventually consistent and it's fine." Staleness directly causes lost or misdelivered messages (you push to a node that no longer holds the device). It needs tight TTLs, updates on connect/disconnect, and a miss policy — consistency here is a delivery- reliability property.
"Heartbeats are free." Ping/pong to hundreds of millions of connections is significant traffic and CPU, and the interval trades liveness-detection latency against cost and NAT-timeout safety. It's a real tuning decision.

7. Interview talking points

"Design a push-notification system for 300M devices." The Pushy question. Spine: WebSocket fleet holding the connections → push registry (device→node) → Kafka + message processor for async fan-out → reconnect strategy with backoff+jitter → graceful drain. Quantify: conns/node (~200k), memory/conn, registry QPS, fan-out rate.
"How do you deploy a node holding 200k live connections?" Stop new connections (readiness off) → signal clients to reconnect → backoff + jitter so they don't stampede onto one node → drain over minutes → update the registry as devices migrate → exit on threshold/deadline. Name the reconnect-storm failure mode and the jitter fix.
"Where does the device→node mapping live and why does consistency matter?" A low-latency KV store (Netflix: KeyValue) with TTLs, updated on connect/disconnect. Staleness = lost/misdelivered messages, so it's a delivery-reliability property, not just a cache.
"How do you get 99.999% delivery over flaky mobile networks?" At-least-once with acks + idempotency; client reconnect with resume; a brief offline queue or drop policy; the async Kafka path so publishers aren't coupled to delivery. Be explicit about the exactly-once myth (you get at-least-once + idempotency).
"How did per-node density go from 60k to 200k?" Better instance type + trimming per-connection memory/CPU (buffers, TLS state, heartbeat cost) found by profiling. fds, event-loop threads, and GC pressure are the limits to push on. Shows you think in per-connection budgets.
"WebSocket vs SSE vs long-poll vs HTTP/2 streams vs WebTransport?" SSE = server→client only, simpler, auto-reconnect, no masking. Long- poll = fallback, high overhead. WebSocket = full-duplex, the workhorse. h2/h3 streams and WebTransport (over QUIC) are the modern alternatives; know the trade-offs (h3/WebTransport survives network changes via connection migration, gw-02).

8. Connections to other labs

gw-01 (L4) — draining at its hardest; keepalive (ping/pong is the app-layer version); SO_REUSEPORT stickiness taken to the extreme.
gw-02 (L7) — the WebSocket upgrade is the other HTTP upgrade path; h3/WebTransport is the future transport for this.
gw-03 (API gateway) — Pushy is a specialized gateway; the event-loop "never block" discipline is life-or-death at 200k conns/node.
gw-06 (resilience) — reconnect backoff+jitter, load shedding on the connect path, and not letting a reconnect storm cascade.
gw-08 / gw-09 — the registry is a distributed datastore; drain is gated by Kubernetes readiness + a long termination grace period.
db-20 (distributed KV) — the push registry is exactly a low-latency distributed key-value store with TTLs; you built one.

gw-05 — The Hitchhiker's Guide to Pushy (WebSockets at Scale)

Companion to CONCEPTS.md, with the runnable miniature Pushy in src/go/pushy/. Two of the JD's seven talks are about this system; being able to whiteboard and build it is close to a guaranteed interview win.

Everywhere else in this phase, a client makes a request and you proxy it (short-lived, request/response). Pushy inverts that: a device opens one long-lived WebSocket and keeps it open for hours, and the server pushes messages to it at any time. That inversion changes everything — connection density becomes the dominant cost, routing becomes "find the node holding this device," and draining a node becomes the hardest operational problem in the phase. This lab builds all of it.

1. The WebSocket protocol (ws.go)

A WebSocket starts as an HTTP/1.1 request with Upgrade: websocket; the server proves it understood by hashing the client's key with a magic GUID. AcceptKey does exactly that, and TestAcceptKey pins it to the RFC 6455 §1.3 worked example (dGhlIHNhbXBsZSBub25jZQ== → s3pPLMBiTxaQ9kYGzzhZRbK+xOo=). After the 101 Switching Protocols response (ServerHandshake), the connection speaks frames.

WriteFrame/ReadFrame implement the framing (RFC 6455 §5): a small header (FIN bit, opcode, mask bit, 7/16/64-bit length), then payload. Two details matter and are tested (TestFrameRoundTripMasked):

Client→server frames MUST be masked (XOR with a 4-byte key) to defeat cache-poisoning of intermediaries; server→client frames are not. Getting the mask direction wrong is the #1 hand-rolled-WebSocket bug.
Control frames (Close, Ping, Pong) are handled inline. Conn.ReadMessage auto-answers a ping with a pong (TestPingAutoPong) — that's your application-layer keepalive, holding NAT mappings open and detecting dead peers (the gw-01 keepalive idea at L7).

A net.Pipe lesson baked into the tests: the ping/pong test uses real loopback TCP, not net.Pipe, because net.Pipe is fully unbuffered — a server writing a pong while the client is still writing the next frame deadlocks. That's not a toy concern: any duplex protocol over an unbuffered transport can deadlock if you don't separate read and write paths. (In production each connection has a dedicated write pump for exactly this reason — see §2.)

2. Holding connections (node.go)

Node holds deviceID -> Sink. A Sink is a per-device outbound channel; Send returns false if the device's bounded queue is full rather than blocking. TestSlowConsumerIsolation proves the property that keeps a node alive: a slow device whose queue fills gets its message dropped, while every other device keeps receiving. One slow consumer must never head-of-line the whole node — at 200k connections per node, a single blocking write would stall thousands of devices.

In the real server (cmd/pushyd), each connection gets exactly one writer goroutine (the "write pump") behind a mutex, because concurrent writes to a WebSocket corrupt framing. The bounded queue + single writer

drop-on-full is the canonical high-density connection model.

Connection density is the whole cost model. Node.Count() is the metric that matters; the talk's 60k → 200k → 400k journey was a profiling exercise in shrinking per-connection memory (buffers, TLS state), fds, and heartbeat cost. Do the arithmetic: 200k conns × ~10 KB state ≈ 2 GB of connection state per node before any payload.

3. The push registry (registry.go)

To deliver to device D you don't search every node — you look up registry[D] = node. MemRegistry is the in-process stand-in for Netflix's KeyValue store; the interface is what matters.

The load-bearing detail is TTLs. TestRegistryTTL (with an injectable clock) shows an entry expiring: a node that crashes never runs its clean Unregister, so without a TTL its entries would point at a dead node forever, and pushes would silently vanish. Expired entries are treated as misses — that's your crashed-node cleanup. Unregister also only removes an entry if this node still owns it, so it can't clobber a device that already reconnected elsewhere (a real race during drain). Registry staleness is a delivery-reliability property, not a caching nicety.

4. Async delivery (node.go)

A backend doesn't call Pushy directly — it Publishes a PushEvent to a topic (Kafka in production, a channel here) and returns immediately. The MessageProcessor consumes the topic, looks up the registry, and forwards to the owning node. TestDeliveryRouting runs the whole path: connect device-42 to node n2, publish to device-42, and assert it arrives on n2's sink with Delivered == 1 — routed purely by registry lookup, with zero coupling between the publisher and the holding node.

TestRegistryMissPolicy covers the other branch: a push to an unconnected device increments Misses and is dropped (or, per policy, queued for redelivery on reconnect). The processor also tracks ForwardFails for the race where a node drains between the lookup and the forward — the device will reconnect elsewhere and the message is redelivered (§5).

Why decouple with a queue? A redeploying or slow Pushy node must not block publishers; the queue is a mailbox, not a phone call. The cost is delivery latency and per-device ordering (partition Kafka by deviceId if order matters).

5. The hardest part: graceful drain (drain.go)

You cannot SIGKILL a node holding 200k live connections. Drain means: stop accepting new connections, tell connected devices to reconnect, and let them migrate over minutes — without a stampede.

The stampede is the trap. If every device reconnects the instant it's told, they all hit the next node simultaneously, overflow its accept queue (gw-01), fail, retry, and you've turned a deploy into a self-inflicted DDoS. The fix is full jitter: each device waits a uniformly random delay over the reconnect window. Jitter.ReconnectDelays assigns those delays; TestReconnectJitterSpread proves that across 1000 devices and a 30 s window, every time-bucket is non-empty — a smooth ramp, not a spike. SpreadBuckets is the histogram that makes it visible.

Backoff adds exponential backoff with full jitter for unrequested disconnects (TestBackoffWithinCap). Note the distinction the CONCEPTS file draws: backoff de-correlates one client's retries; jitter de-correlates many clients from each other. You need both.

Finally, drain forces at-least-once + idempotent delivery: a message in flight when a device reconnects may be redelivered. Dedup (TestAtLeastOnceDedup) tracks recently-seen MsgIDs with bounded memory so the client suppresses duplicates — which is how you reach "five nines" honestly (at-least-once delivery + idempotent client = effectively-once for the user; true exactly-once is a myth).

6. Hands-on

cd src/go
bash ../scripts/verify.sh        # tests -race

go build -o /tmp/pushyd ./cmd/pushyd
/tmp/pushyd -ws :8090 -admin :8091 &

# Connect a device (first text message = deviceId). With websocat:
#   echo device-42 | websocat ws://127.0.0.1:8090/ws -
curl localhost:8091/stats                                  # connected devices
curl -X POST 'localhost:8091/deliver?device=device-42' -d 'new episode!'

If you don't have websocat, write a 20-line Go client using ClientHandshake + WriteMessage — the package exports both sides.

7. Exercises

Write the drain loop end to end: on SIGTERM, Node sends each device a JSON {"type":"reconnect","afterMs":N} using ReconnectDelays, waits for Count() to fall (bounded by a deadline), then exits. Simulate a fleet and plot reconnect arrivals/s with and without jitter (the spike vs the flat ramp).
Add a write pump: give each connection a bounded outbound channel and a single writer goroutine; show that a kill -STOP'd client gets dropped without affecting others (the §2 property, over real sockets).
Make the registry distributed: back Registry with a real KV (Redis) and add a reconciliation job that sweeps orphaned entries from crashed nodes. Measure lookup latency on the delivery path.
Partition for ordering: route PushEvents through N topic partitions keyed by deviceId and prove per-device ordering survives concurrent processors.
Density profiling: connect 50k fake devices to one Node and measure RSS per connection; find the biggest per-connection cost and shrink it (the 60k→200k journey in miniature).

gw-05 — References

The primary sources (named in the JD)

Pushy to the Limit: Evolving Netflix's WebSocket proxy for the future — TechBlog, 2024. Hundreds of millions of connections; registry on KeyValue (was Dynomite); Kafka message processor; 60k→200k→400k conns/node; 99.999% delivery. https://netflixtechblog.com/pushy-to-the-limit-evolving-netflixs-websocket-proxy-for-the-future-b468bc0ff658
Scaling Push Messaging for Millions of Devices @Netflix — InfoQ / re:Invent 2018. The original Zuul Push / Pushy architecture. https://www.infoq.com/news/2018/07/zuul-push-messaging/
Push Messaging — Netflix/zuul wiki (the open-source Zuul Push building blocks). https://github.com/Netflix/zuul/wiki/Push-Messaging

Protocol

RFC 6455 — The WebSocket Protocol. Read §1.3 (handshake), §5 (framing + masking), §5.5 (control frames: close/ping/pong).
RFC 8441 — Bootstrapping WebSockets with HTTP/2 (CONNECT + :protocol). The h2 path for WebSockets.
WebTransport (over HTTP/3) — the QUIC-based future (gw-02): survives network changes via connection migration. https://www.w3.org/TR/webtransport/
Server-Sent Events (WHATWG HTML EventSource) — the simpler server→client-only alternative.

Implementations

nhooyr.io/websocket and github.com/gorilla/websocket (Go) — the two common Go WebSocket libraries used in the steps.
Netty WebSocketServerProtocolHandler — the Java/Netty path (the Pushy lineage).
k6 (with the websockets module) or Gatling — load generators that can open many concurrent WebSocket connections.

Scaling & ops

The C10M problem / "millions of connections per box" writeups (WhatsApp's 2M-connections-per-server post is the classic).
Linux tuning for many connections: fs.file-max, ulimit -n, net.ipv4.ip_local_port_range, net.core.somaxconn, ephemeral-port and conntrack budgets (gw-01).
Marc Brooker — Exponential Backoff And Jitter (the reconnect-storm fix). https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/

Cross-lab dependencies

Upstream: gw-01 (drain/keepalive), gw-02 (upgrade path), gw-03 (event-loop discipline), db-20 (the registry is a distributed KV).
Downstream: gw-06 (reconnect backoff/jitter, load shed), gw-09 (readiness + termination grace for drain), gw-11 (delivery-reliability SLOs).

gw-05 — Analysis

The design review for the miniature Pushy you build in steps/.

Required behaviors

Hold and authenticate connections. Each WebSocket is authenticated on connect (gw-07), registered, and kept alive with ping/pong; a dead peer is detected and cleaned up.
Route by registry. A message addressed to device D is delivered to the node currently holding D, looked up in the registry; a registry miss follows an explicit policy (drop / queue / error).
Async, decoupled delivery. Publishers enqueue (Kafka in prod; a channel/topic in the lab) and return immediately; delivery happens off that queue so a slow node never blocks a publisher.
Graceful drain. On shutdown: stop new connections, signal reconnect, drain with backoff+jitter over a long window, keep the registry consistent as devices migrate.
Bounded per-connection cost. Small buffers, bounded write queues per connection, idle/keepalive timeouts — density is the constraint.

Design decisions

Registry as an interface. Register/Lookup/Unregister/TTL behind an interface, backed by an in-memory map in the lab but shaped like a KV service (KeyValue/Redis) so the production swap is obvious. TTLs are first-class: a crashed node's entries must expire.
Per-connection bounded write queue. Each connection has a small buffered channel for outbound messages; if it fills (slow client), you drop-oldest or disconnect per policy — never block the delivery path on one slow consumer (head-of-line across devices).
Reconnect with backoff + full jitter. The drain path sends a "reconnect" control message; clients wait random(0, base·2^attempt) capped — full jitter, the AWS-builders'-library recommendation — precisely to avoid the reconnect storm.
Delivery is at-least-once + idempotent. Messages carry an id; clients dedupe. The lab demonstrates a redelivery on reconnect and the client suppressing the duplicate, which is how you reach the "five nines" claim honestly (not exactly-once).

Tradeoffs worth flagging

Density vs resilience headroom. Packing 200k–400k connections per node maximizes efficiency but raises blast radius: losing one node forces 200k reconnects. Size nodes and drain windows so a node loss is absorbable by the rest of the fleet without a cascading storm.
Registry consistency vs latency. A strongly consistent registry costs latency on every connect/disconnect (high-churn operations); an eventually consistent one risks misdelivery. The pragmatic answer is a fast KV with short TTLs + update-on-event + a miss policy, accepting a small misdelivery rate covered by retries.
Sticky load that scale-out can't fix. New nodes only attract new connections. The only way to cool a hot node is to cycle its connections, which costs a controlled reconnect wave. Capacity planning must account for connection lifetime, not just connection rate.
Kafka decoupling adds delivery latency and an ordering question. The mailbox model means a small delay and per-key ordering concerns (does device D need messages in order? then partition by D). State the ordering guarantee you're providing.

What production adds beyond this lab

TLS termination + device identity (gw-07) on every connection at fleet scale, with cert rotation.
A real distributed registry (KeyValue/Redis) with replication, TTLs, and a reconciliation job for orphaned entries (crashed nodes).
Kafka partitioning by device for ordered delivery, consumer-group scaling, and backpressure when delivery falls behind.
Connection-density tuning: instance selection, fd limits, event-loop sizing, GC/allocation profiling, heartbeat-interval tuning — the work behind the 60k→200k→400k jump.
Drain orchestration integrated with Kubernetes (long termination grace, preStop reconnect signal, PodDisruptionBudgets) — gw-09, gw-12.
Delivery-reliability observability: per-message acks, delivery latency, registry hit/miss, reconnect-storm detection (gw-11).

gw-05 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline). Optional: websocat for the live server.

One-shot

cd gw-05-websockets-pushy && bash scripts/verify.sh   # → "=== gw-05 OK ==="

Per-language workflow (Go)

cd gw-05-websockets-pushy/src/go
go test -race -count=1 ./...        # 11 tests in package pushy
go build -o /tmp/pushyd ./cmd/pushyd

Run the node

/tmp/pushyd -ws :8090 -admin :8091

# connect a device (first text message is its id):
echo device-42 | websocat ws://127.0.0.1:8090/ws -
# from another shell:
curl localhost:8091/stats
curl -X POST 'localhost:8091/deliver?device=device-42' -d 'hello'

No websocat? Use pushy.ClientHandshake + WriteMessage in a tiny Go client (both sides of the protocol are exported).

Package map

File	What
`pushy/ws.go`	RFC 6455 handshake + framing (masking, ping/pong, close), `Conn`
`pushy/registry.go`	device→node registry with TTLs (crashed-node cleanup)
`pushy/node.go`	connection holding, slow-consumer isolation, message processor, publish
`pushy/drain.go`	full jitter, exponential backoff, reconnect-delay spreading, at-least-once dedup
`cmd/pushyd`	runnable WebSocket node + push admin API

See GUIDE.md for the deep dive (incl. the net.Pipe deadlock lesson and drain).

gw-05 — Verification

One command

cd gw-05-websockets-pushy && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestAcceptKey`	the RFC 6455 §1.3 accept-key derivation is exact
`TestFrameRoundTripMasked`	masked client→server frames encode/decode (unmask) correctly
`TestHandshakeAndMessage`	full upgrade handshake + a request/response message over a real connection
`TestPingAutoPong`	a ping is auto-answered with a pong on the wire (app-layer keepalive)
`TestRegistryTTL`	registry entries expire (crashed-node cleanup; expired = miss)
`TestDeliveryRouting`	a published event is routed to the owning node purely via registry lookup
`TestRegistryMissPolicy`	a push to an unconnected device is a counted miss, not a crash
`TestSlowConsumerIsolation`	a full per-device queue drops (doesn't block); other devices unaffected
`TestReconnectJitterSpread`	full-jitter reconnect delays spread across all buckets (no stampede)
`TestAtLeastOnceDedup`	a redelivered MsgID is suppressed (idempotent client)
`TestBackoffWithinCap`	exponential backoff with jitter stays within the cap

All under -race with a 60s timeout (the timeout guards against the net.Pipe-style deadlocks discussed in GUIDE §1).

What "green" does NOT guarantee

No TLS / device identity. Production terminates TLS and authenticates each connection (gw-07).
In-memory registry. A real registry is a replicated KV with a reconciliation sweep (GUIDE §7.3).
No full drain loop wired in. drain.go provides the primitives; wiring SIGTERM → reconnect signals → bounded wait is an exercise (GUIDE §7.1) and needs Kubernetes' long terminationGracePeriodSeconds (gw-09).
Density not measured here. The 60k→200k journey is a profiling exercise (GUIDE §7.5).

gw-05 step 01 — A WebSocket server and a push registry

Goal

Stand up a WebSocket server that accepts device connections, registers each deviceId → node in a push registry, and keeps connections alive with ping/pong. This is the connection-holding half of Pushy.

Code — `src/go/node.go`

package pushy

import (
	"context"
	"sync"
	"time"

	"nhooyr.io/websocket" // go get nhooyr.io/websocket
)

// Node is one Pushy instance: it holds live device connections and can
// write a message down any of them.
type Node struct {
	ID       string
	Registry Registry

	mu    sync.RWMutex
	conns map[string]*deviceConn // deviceId -> connection
}

type deviceConn struct {
	deviceID string
	ws       *websocket.Conn
	out      chan []byte // bounded per-connection write queue (backpressure)
}

func NewNode(id string, r Registry) *Node {
	return &Node{ID: id, Registry: r, conns: map[string]*deviceConn{}}
}

// Handle runs for the life of one device connection.
func (n *Node) Handle(ctx context.Context, ws *websocket.Conn, deviceID string) {
	dc := &deviceConn{deviceID: deviceID, ws: ws, out: make(chan []byte, 16)}

	n.mu.Lock()
	n.conns[deviceID] = dc
	n.mu.Unlock()
	n.Registry.Register(deviceID, n.ID, 60*time.Second) // device D lives on this node

	defer func() {
		n.mu.Lock()
		delete(n.conns, deviceID)
		n.mu.Unlock()
		n.Registry.Unregister(deviceID, n.ID)
		ws.Close(websocket.StatusNormalClosure, "bye")
	}()

	go n.writePump(ctx, dc) // drains dc.out to the socket
	n.readPump(ctx, dc)     // reads client frames + keeps registry TTL fresh
}

// writePump is the ONLY goroutine that writes to the socket — serializes
// writes and applies a per-connection deadline.
func (n *Node) writePump(ctx context.Context, dc *deviceConn) {
	ping := time.NewTicker(20 * time.Second) // app-layer keepalive
	defer ping.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case msg := <-dc.out:
			wctx, cancel := context.WithTimeout(ctx, 5*time.Second)
			err := dc.ws.Write(wctx, websocket.MessageText, msg)
			cancel()
			if err != nil {
				return
			}
		case <-ping.C:
			pctx, cancel := context.WithTimeout(ctx, 5*time.Second)
			err := dc.ws.Ping(pctx)
			cancel()
			if err != nil {
				return // dead peer detected
			}
		}
	}
}

func (n *Node) readPump(ctx context.Context, dc *deviceConn) {
	for {
		_, _, err := dc.ws.Read(ctx) // client->server msgs / pongs
		if err != nil {
			return // connection closed
		}
		n.Registry.Touch(dc.deviceID, n.ID, 60*time.Second) // refresh TTL
	}
}

// Deliver enqueues a message for a locally-connected device. Returns
// false if the device isn't on this node (caller should consult registry).
func (n *Node) Deliver(deviceID string, msg []byte) bool {
	n.mu.RLock()
	dc, ok := n.conns[deviceID]
	n.mu.RUnlock()
	if !ok {
		return false
	}
	select {
	case dc.out <- msg:
		return true
	default:
		// Write queue full (slow client): drop or disconnect per policy.
		// Never block the delivery path on one slow device.
		return false
	}
}

Code — the registry interface

package pushy

import (
	"sync"
	"time"
)

// Registry maps deviceId -> owning node, with TTLs so a crashed node's
// entries expire. Backed by a map here; KeyValue/Redis in production.
type Registry interface {
	Register(deviceID, node string, ttl time.Duration)
	Touch(deviceID, node string, ttl time.Duration)
	Unregister(deviceID, node string)
	Lookup(deviceID string) (node string, ok bool)
}

type MemRegistry struct {
	mu sync.RWMutex
	m  map[string]entry
}
type entry struct {
	node    string
	expires time.Time
}

func NewMemRegistry() *MemRegistry { return &MemRegistry{m: map[string]entry{}} }

func (r *MemRegistry) Register(d, n string, ttl time.Duration) {
	r.mu.Lock(); defer r.mu.Unlock()
	r.m[d] = entry{node: n, expires: time.Now().Add(ttl)}
}
func (r *MemRegistry) Touch(d, n string, ttl time.Duration) { r.Register(d, n, ttl) }
func (r *MemRegistry) Unregister(d, n string) {
	r.mu.Lock(); defer r.mu.Unlock()
	if e, ok := r.m[d]; ok && e.node == n {
		delete(r.m, d)
	}
}
func (r *MemRegistry) Lookup(d string) (string, bool) {
	r.mu.RLock(); defer r.mu.RUnlock()
	e, ok := r.m[d]
	if !ok || time.Now().After(e.expires) {
		return "", false // expired entries are misses (crashed-node cleanup)
	}
	return e.node, true
}

Tasks

Wire Node.Handle behind an http.HandlerFunc that accepts the upgrade (websocket.Accept) and reads a deviceId query param.
Connect a few clients (a k6 script or a tiny Go client). Confirm the registry shows each deviceId → nodeID and that entries expire when a client disconnects.
Kill a client uncleanly (SIGKILL the client process); confirm ping/pong detects the dead peer within the keepalive interval and the registry entry is removed/expires.

Acceptance

Multiple devices stay connected; registry reflects ownership with TTLs.
A dead peer is detected via failed ping and cleaned up.
A slow consumer (full out queue) does not block delivery to other devices (you can prove this in step 02).

Discussion prompts

Why is there exactly one writer goroutine per connection (the writePump)? (Concurrent writes to a WebSocket corrupt framing.)
Why must registry entries have a TTL even though you Unregister on clean close? (Crashes/network partitions never run your defer.)
At 200k connections/node, how big can out be? Do the memory math for the buffer size you chose.

gw-05 step 02 — Async delivery via a message queue

Goal

Add the push half: a backend publishes "send msg to device D" to a queue; a message processor consumes it, looks up the registry, and forwards to the owning node, which writes it down the socket. In production the queue is Kafka and the processor is a Spring Boot service; the lab uses a Go channel topic so the architecture is legible.

Code — the message processor

package pushy

import (
	"context"
	"encoding/json"
	"log"
)

// PushEvent is what a backend publishes: "deliver Payload to DeviceID".
type PushEvent struct {
	DeviceID string `json:"deviceId"`
	Payload  []byte `json:"payload"`
	MsgID    string `json:"msgId"` // for at-least-once dedupe (step 03)
}

// NodeClient forwards a message to a (possibly remote) node that holds
// the device. In the lab it's an in-process map of nodes; in production
// it's an RPC to the owning Pushy instance.
type NodeClient interface {
	Forward(ctx context.Context, node string, ev PushEvent) error
}

// MessageProcessor consumes push events and routes them by registry.
type MessageProcessor struct {
	Registry Registry
	Nodes    NodeClient
}

// Consume reads events off the topic (Kafka in prod) and routes each.
func (mp *MessageProcessor) Consume(ctx context.Context, topic <-chan []byte) {
	for {
		select {
		case <-ctx.Done():
			return
		case raw := <-topic:
			var ev PushEvent
			if err := json.Unmarshal(raw, &ev); err != nil {
				continue
			}
			node, ok := mp.Registry.Lookup(ev.DeviceID)
			if !ok {
				// Registry miss: device offline or moved. Policy: drop,
				// or push to an offline queue for redelivery on reconnect.
				log.Printf("miss: device %s not connected; dropping %s",
					ev.DeviceID, ev.MsgID)
				continue
			}
			if err := mp.Nodes.Forward(ctx, node, ev); err != nil {
				// Node gone between lookup and forward (it drained): the
				// device will reconnect elsewhere; redeliver or drop.
				log.Printf("forward to %s failed: %v", node, err)
			}
		}
	}
}

Code — publisher (what a backend calls)

// Publish is fire-and-forget from the backend's perspective: enqueue and
// return. The backend is NOT coupled to delivery latency or to a
// redeploying Pushy node — that's the whole point of the queue.
func Publish(topic chan<- []byte, ev PushEvent) {
	raw, _ := json.Marshal(ev)
	select {
	case topic <- raw:
	default:
		// Topic full: in prod Kafka absorbs this; here, apply backpressure
		// or shed. Surfacing the bound is the lesson.
	}
}

Wire it up (lab, in-process)

reg := NewMemRegistry()
n1 := NewNode("n1", reg)
n2 := NewNode("n2", reg)
nodes := mapNodeClient{"n1": n1, "n2": n2} // Forward -> node.Deliver

topic := make(chan []byte, 1024)
mp := &MessageProcessor{Registry: reg, Nodes: nodes}
go mp.Consume(ctx, topic)

// A backend pushes to device D (connected to whichever node):
Publish(topic, PushEvent{DeviceID: "device-42", Payload: []byte("new episode!"), MsgID: "m1"})

Tasks

Implement the processor and a NodeClient that maps a node id to a local *Node and calls Deliver. Connect device-42 to n2, publish to device-42, and confirm it arrives over the socket on n2 — routed purely via the registry.
Registry miss: publish to a device that isn't connected; confirm the documented miss policy fires (drop + log, or offline-queue).
Slow consumer isolation: make one device's client stop reading. Show its out queue fills and it is dropped/disconnected, while delivery to other devices is unaffected (no head-of-line across devices).
Swap the channel topic for a real Kafka topic (segmentio/kafka-go) as a stretch; the processor code is unchanged — that's the point of the seam.

Acceptance

A message published by a "backend" reaches the correct device via registry lookup + forward, with zero coupling between publisher and the holding node.
Registry miss and slow-consumer cases follow explicit policies and don't stall the system.

Discussion prompts

Why decouple with a queue instead of the backend calling Pushy directly? (Publisher isn't blocked by a draining/slow node; natural buffering; fan-out.)
If device D needs messages in order, what must be true of the Kafka partitioning? (Partition by deviceId so one partition = one device's ordered stream.)
There's a race: registry says n2, but n2 drains before Forward. How do you make this safe? (Redeliver on forward failure + client reconnect + idempotency — step 03.)

gw-05 step 03 — Graceful drain and avoiding the reconnect storm

Goal

Deploy a node holding many live connections without dropping messages and without a reconnect stampede onto the rest of the fleet. This is the hardest operational problem in the phase and a guaranteed interview topic.

Code — drain with a reconnect signal

package pushy

import (
	"context"
	"encoding/json"
	"time"
)

// Drain stops new connections, asks every connected device to reconnect
// (elsewhere) after a JITTERED delay, and waits for connections to bleed
// off — up to a long deadline.
func (n *Node) Drain(ctx context.Context, deadline time.Duration) {
	// 1. (Caller already flipped readiness so the LB sends no NEW conns.)

	// 2. Tell every device to reconnect after a random delay so they
	//    don't all hit the next node at the same instant.
	n.mu.RLock()
	conns := make([]*deviceConn, 0, len(n.conns))
	for _, dc := range n.conns {
		conns = append(conns, dc)
	}
	n.mu.RUnlock()

	for _, dc := range conns {
		// Full jitter: each client waits random(0, reconnectWindow).
		delayMs := jitterMillis(30_000) // spread reconnects over 30s
		msg, _ := json.Marshal(map[string]any{
			"type": "reconnect", "afterMs": delayMs,
		})
		dc.enqueue(msg) // client schedules reconnect after delayMs + its own backoff
	}

	// 3. Wait for connections to drain, bounded by the deadline.
	end := time.Now().Add(deadline)
	for time.Now().Before(end) {
		n.mu.RLock()
		remaining := len(n.conns)
		n.mu.RUnlock()
		if remaining == 0 {
			return
		}
		time.Sleep(500 * time.Millisecond)
	}
	// 4. Deadline hit: force-close stragglers (they reconnect via backoff).
	n.closeAll()
}

Code — full jitter (the storm-prevention primitive)

import "math/rand"

// jitterMillis returns a uniformly random delay in [0, max) — "full
// jitter" from the AWS builders' library. This is what turns a
// synchronized stampede into a smooth ramp.
func jitterMillis(max int) int { return rand.Intn(max) }

// Client-side reconnect backoff (for reference): exponential with full
// jitter, capped. Used after an UNREQUESTED disconnect too.
func reconnectDelay(attempt int) time.Duration {
	const base, cap = 500 * time.Millisecond, 30 * time.Second
	d := base << attempt
	if d > cap {
		d = cap
	}
	return time.Duration(rand.Int63n(int64(d))) // full jitter
}

The experiment — see the storm, then prevent it

Simulate a fleet (a few nodes, N simulated devices) and a metric of "reconnects per second arriving at the rest of the fleet" during a drain.

A) Drain with NO jitter (all devices reconnect immediately):
   reconnect arrivals spike to ~N in one tick → the next node's accept
   queue overflows (gw-01) → some reconnects fail → clients retry →
   the spike echoes. A self-inflicted DDoS.

B) Drain WITH full jitter over a 30s window:
   reconnect arrivals are a flat ~N/30 per second → every node absorbs
   its share → no accept-queue overflow → smooth migration.

Plot both. The difference is the entire lesson.

At-least-once + idempotency on reconnect

When a device reconnects mid-drain, an in-flight message may be redelivered (the registry pointed at the old node when it was published). Make delivery safe:

- every PushEvent carries MsgID
- the client tracks recently-seen MsgIDs and suppresses duplicates
- the server may redeliver freely (at-least-once) — dedupe makes it
  effectively-once from the user's view

Tasks

Implement Drain with full jitter and a long deadline; a simulated client that honors the reconnect message (waits afterMs, then reconnects to another node with its own backoff+jitter).
Run experiment A vs B; plot reconnect arrivals/sec at the rest of the fleet. Capture the spike vs the flat ramp.
Show a redelivered message during drain and the client suppressing the duplicate via MsgID — demonstrating at-least-once + idempotency.
Tie drain to Kubernetes: set a long terminationGracePeriodSeconds and a preStop hook that triggers Drain; explain why a default 30s grace period would force-close 200k connections (gw-09).

Acceptance

A drain that migrates all devices off a node with no message loss and no reconnect spike (flat arrivals with jitter; a clear spike without it).
A demonstrated duplicate suppressed by MsgID.
A correct statement of the Kubernetes grace-period requirement for a high-density node.

Discussion prompts

Why "full jitter" rather than "exponential backoff" alone? (Backoff spreads retries of one client; jitter de-synchronizes many clients. You need both.)
Draining 200k connections over 30s = ~6.7k reconnects/sec leaving this node. Is that absorbable by a fleet of M nodes? Do the math and decide the right reconnect window.
Scale-out adds nodes but existing connections don't move. So how do you actually cool a hot node — and why is that the same machinery as drain?

gw-06 — Resilience: Load Balancing, Retries, Circuit Breaking, Adaptive Concurrency

The JD lists "uplift the resilience posture for traffic traversing the cloud" as a primary responsibility. Resilience is the set of data-plane behaviors that keep the gateway (and the services behind it) serving during the inevitable: a slow origin, a dead instance, a traffic spike, a dependency brown-out. Done well, the gateway absorbs trouble. Done poorly, the gateway amplifies it — a retry storm turns a blip into an outage, a missing concurrency limit turns a slow dependency into a full collapse.

This lab covers the resilience toolkit a Cloud Gateway engineer must own: client-side load balancing (round-robin, power-of-two-choices, least-loaded, zone-aware), outlier ejection, timeouts, retries with budgets and hedging, circuit breakers, adaptive concurrency limits, and load shedding. The unifying theme is the one Netflix and Google learned the hard way: the failure modes that take you down are usually self-inflicted amplification.

You will implement power-of-two-choices load balancing and an adaptive concurrency limiter, and reproduce a retry storm to see amplification (and a budget stop it).

1. What is it?

Resilience patterns split into choosing where to send a request (load balancing + outlier ejection), deciding whether and how to send it again (timeouts, retries, hedging), and protecting yourself and your dependencies from overload (circuit breakers, concurrency limits, load shedding).

request ─▶ [pick endpoint: LB over healthy subset]
        ─▶ [send with a timeout]
        ─▶ on failure: retry? (budget + idempotency) / hedge?
        ─▶ if endpoint is bad: eject it (outlier detection)
        ─▶ if dependency is failing: open circuit (fail fast)
        ─▶ if WE are overloaded: shed/limit at admission (adaptive concurrency)

The deepest of these — and the most interview-worthy — is adaptive concurrency: instead of a fixed thread/connection limit, the gateway measures latency and infers the right in-flight concurrency from Little's Law, shedding load at admission when the system is at its knee. It's a control loop, conceptually like TCP congestion control.

2. Why does it matter?

Amplification is how outages happen. A 1% origin error rate with a 3× retry policy and no budget becomes a 3% extra load exactly when the origin is struggling — pushing it further over, causing more errors, more retries: a metastable failure that doesn't recover even after the trigger clears. The single most important resilience insight is "don't amplify," and the JD's "resilience posture" is largely about not building amplifiers.
The gateway is the natural place for cross-cutting resilience. Putting retries/circuit-breaking/limits at the edge means every service gets them without re-implementing, and the gateway can make fleet-wide decisions (shed the least important traffic first).
Tail latency dominates user experience. p99/p99.9, not p50, is what users feel, and it's where hedging, outlier ejection, and good LB pay off. A single slow instance in a naive round-robin set drags the tail; P2C and outlier ejection route around it.
It's where data-driven root-cause lives. The JD's emphasis on "identify root causes using data" maps directly here: distinguishing "the dependency is slow" from "we're amplifying with retries" from "we're concurrency-limited" requires the right metrics and the discipline to read them (gw-11).

3. How does it work?

Load balancing: round-robin → P2C → least-loaded

Round-robin spreads evenly but is blind to load — it keeps sending to a slow/overloaded instance.
Least-connections / least-loaded picks the instance with the fewest in-flight requests — load-aware, but requires global state (which instance has how many) that's expensive/stale at scale.
Power of Two Choices (P2C) is the sweet spot: pick two instances at random, send to the less loaded of the two. No global coordination, yet provably avoids the worst hot spots ("the power of two random choices" result — exponential reduction in max load vs one random choice). This is what modern proxies (Envoy, Finagle) default to.

P2C:  a, b = two random endpoints
      pick whichever has fewer in-flight (or lower EWMA latency)

Zone/locality-aware LB prefers same-zone endpoints (lower latency, no cross-AZ data cost) while keeping a fallback to other zones — the edge cares about this for both latency and cloud-egress cost.

Outlier ejection

Continuously watch per-endpoint error/latency. When one crosses a threshold (e.g. consecutive 5xx, or latency >> the set's median), eject it from rotation for a cool-down, then probe it back. This is passive health checking from real traffic — it reacts faster than periodic active health checks and catches "the instance is up but broken."

Timeouts, retries, budgets, hedging

Timeout — the foundational primitive (gw-03). Every call is bounded. A retry without a tight timeout is useless (you wait forever, then wait again).
Retry — re-send on a retryable failure (connection refused, 503, timeout) — but only for idempotent requests, and never retry a request that the origin may have already processed unless it's safe.
Retry budget — cap retries to a fraction (e.g. 10%) of total requests, fleet-wide, so retries can never become a large multiple of base load. This is the anti-amplification mechanism; prefer it to fixed per-request retry counts.
Hedging — for latency (not errors): if a request hasn't responded by the p95, send a second copy to another endpoint and take the first to answer. Cuts the tail dramatically; costs extra load, so it's also budgeted.

Circuit breaker

When a dependency is failing, stop calling it for a while — fail fast instead of piling up doomed requests (which exhausts your concurrency and slows everything). The classic state machine:

CLOSED ── error rate exceeds threshold ──▶ OPEN ── cool-down elapses ──▶ HALF-OPEN
   ▲                                                                        │
   └──────────────── probe succeeds ──────────────────────────────────────┘
          (probe fails in HALF-OPEN ──▶ back to OPEN)

OPEN = reject immediately (or serve a fallback). HALF-OPEN = let a trickle through to test recovery. This bounds the damage a sick dependency can do to the gateway itself.

Adaptive concurrency (the deep one)

A fixed concurrency limit is always wrong: too low wastes capacity, too high lets a slow dependency build an unbounded queue (latency explodes). Adaptive concurrency measures and adjusts. The theory is Little's Law:

L = λ × W        (in-flight = arrival_rate × latency)

When the system is healthy, increasing concurrency raises throughput with flat latency. Past the knee, latency rises but throughput doesn't — you're queueing. An AIMD-style controller (like TCP congestion control, or the Gradient algorithm Netflix open-sourced in concurrency-limits) tracks the minimum observed latency (RTT_noload) and the current latency, and shrinks the limit when current >> minimum (a queue is forming), grows it when they're close. Requests over the limit are shed at admission (fast 503) rather than queued — bounded latency under overload.

Load shedding & prioritization

When you must drop, drop the least important traffic first (criticality tiers): shed retries before originals, background before interactive, non-paying before paying. The gateway is the right place to encode this because it sees all traffic.

4. Core terminology

Term	Definition
Round-robin / least-loaded	Blind even spread / load-aware pick (needs global state).
Power of Two Choices (P2C)	Pick 2 random endpoints, send to the less-loaded; near-optimal, no coordination.
EWMA	Exponentially-weighted moving average (of latency) used to rank endpoints.
Outlier ejection	Removing a bad endpoint from rotation based on live error/latency.
Zone-aware LB	Prefer same-AZ endpoints for latency and egress cost, with fallback.
Retry budget	A cap on retries as a fraction of total traffic; the anti-amplification rule.
Hedging	Sending a duplicate request after a delay to cut tail latency.
Idempotency	A request safe to send more than once; required for safe retries.
Circuit breaker	CLOSED/OPEN/HALF-OPEN state machine that fails fast on a sick dependency.
Adaptive concurrency	Dynamically inferring the right in-flight limit from latency (Little's Law).
Little's Law	`L = λ × W`: in-flight = arrival rate × time-in-system.
Load shedding	Rejecting (the least important) requests at admission under overload.
Metastable failure	A self-sustaining overload that persists after the trigger clears.

5. Mental models

Retries without a budget are a payday loan. They feel like help in the moment and bury you exactly when you're weakest. A budget caps the interest: retries can add at most X% load, never a multiple.
A circuit breaker is a fuse, not a switch you flip. It trips automatically to protect the house when current spikes, and resets cautiously (half-open). Without it, a shorted appliance (sick dependency) burns down the house (exhausts the gateway).
Adaptive concurrency is cruise control on a hill. A fixed throttle (fixed limit) either crawls on the flat or redlines on the climb. Cruise control measures speed (latency) and adjusts the throttle (limit) to hold the target — backing off the instant the engine strains (latency rises above the no-load minimum).
P2C is hiring with two résumés instead of one or all. Reading every résumé (global least-loaded) is too slow at scale; picking one blindly (random/round-robin) sometimes hires the worst. Comparing just two random candidates avoids the worst hire almost as well as reading them all — for almost no cost.
Load shedding is triage, not failure. Under overload you will drop something; choosing to drop the least important traffic on purpose is the system staying in control, versus dropping random (often the most important) traffic by collapsing.

6. Common misconceptions

"More retries = more reliable." The opposite past a point: retries amplify load during failures and cause metastable collapse. Reliability comes from budgeted retries + timeouts + circuit breaking + shedding, not from a bigger retry count.
"Round-robin is fine, instances are identical." They aren't, even when identical — GC pauses, noisy neighbors, a slow disk, an unlucky cache miss make instances momentarily slow. Load-aware LB (P2C) and outlier ejection route around transient slowness; round-robin walks right into it.
"A fixed thread/connection limit protects me." A fixed limit set for the healthy case becomes a queue-builder when the dependency slows (in-flight piles up to the limit, latency explodes). Adaptive limits shrink as latency rises; fixed limits don't.
"Hedging fixes errors." Hedging is for latency (tail), not errors — sending a second copy of a request that fails for a deterministic reason just doubles the failures. Use retries (budgeted) for errors, hedging for tail latency.
"The circuit breaker should open on the first error." Too sensitive and it flaps, cutting healthy traffic. Tune thresholds (error rate over a window, minimum request volume) so it trips on sustained failure, not noise.

7. Interview talking points

"How do you keep a slow dependency from taking down the gateway?" Timeout every call → circuit-break the dependency when it's sustainedly failing (fail fast, serve fallback) → adaptive concurrency limit so in-flight to it is bounded and excess is shed at admission → shed the least-important traffic first. The thread: bound the blast, don't queue the doomed.
"Walk me through a retry storm and how to prevent it." A blip → everyone retries → retries are the new load → origin stays over its knee → metastable. Prevent with: tight timeouts, retry budgets (cap retries to ~10% of traffic), exponential backoff + jitter, circuit breaking, and not retrying non-idempotent requests. Mention you'd verify with retries / requests ratio in metrics (gw-11).
"Why power-of-two-choices over round-robin or least-loaded?" RR is load-blind; global least-loaded needs expensive/stale coordination; P2C picks the better of two random endpoints and gets near-least-loaded behavior with O(1) local info — the "power of two random choices" result. Pair it with EWMA latency and outlier ejection.
"Explain adaptive concurrency limiting." Little's Law: in-flight = rate × latency. Track minimum (no-load) latency; when current latency climbs above it, a queue is forming, so shrink the limit (AIMD / gradient); when they're close, grow it. Shed over-limit requests at admission. It's congestion control for your service, and it needs no magic-number tuning. Cite Netflix's concurrency-limits.
"When is it safe to retry?" Idempotent requests, retryable failures (connection refused, 503, timeout where you know the origin didn't act), within budget, with backoff+jitter. Never blindly retry a POST that may have committed. gRPC: retry only on specific grpc-status codes (gw-02).
"You must drop traffic — what goes first?" Criticality tiers: shed retries before originals, background/prefetch before interactive, and protect the highest-value flows. The gateway encodes this because it sees everything. Tie to Netflix's "the show must go on" — degrade gracefully, never hard-fail the core experience.

8. Connections to other labs

gw-03 (API gateway) — all of this lives in the endpoint phase; the per-request timeout is the foundation everything else builds on.
gw-04 (connection management) — LB and outlier ejection operate within the subset; pool acquisition is itself a concurrency limit.
gw-05 (Pushy) — reconnect backoff+jitter and connect-path load shedding are this lab applied to persistent connections.
gw-08 (Envoy/xDS) — Envoy implements all of these (P2C, outlier detection, circuit breakers, adaptive concurrency) configured via xDS; the control plane tunes them per route/cluster.
gw-11 (observability) — you can't run any of this blind; the retry ratio, circuit state, ejection events, and concurrency limit are all metrics.
db-17 (Raft) — quorum/majority intuition shows up in "how big must a subset be to survive f failures" and in not amplifying during partitions.

gw-06 — The Hitchhiker's Guide to Data-Plane Resilience

Companion to CONCEPTS.md, with the runnable toolkit in src/go/resil/. The unifying lesson: the failures that take you down are usually self-inflicted amplification — and every primitive here is a way to refuse to amplify.

bash scripts/verify.sh runs the tests and the retry-storm simulator, which prints the whole thesis of this lab:

offered load per tick (capacity = 120; spike at ticks 10-15):
  tick:        0    4    8   12   16   20   24   28   32   36
  naive:     100  100  100  680 5000 5000 5000 5000 5000 5000
  budgeted:  100  100  100  220  120  100  100  100  100  100
after the spike clears:
  naive offered load    = 5000  (STILL OVER CAPACITY — metastable collapse)
  budgeted offered load = 100  (recovered to baseline)

A transient spike (6 ticks) becomes a permanent outage under naive retries, because the retries become the new load and feed back on themselves. A retry budget caps the feedback and the system self-heals the instant the spike ends. Internalize that picture; it's the single most important resilience insight, and it's a guaranteed interview story.

1. Power of Two Choices (balancer.go)

Round-robin is load-blind: it keeps feeding a slow instance. Global least-loaded needs expensive, stale coordination. P2C is the sweet spot — pick two endpoints at random, send to the lower-score one — and the "power of two random choices" result says this gets you near-optimal max-load with O(1) local info.

score = (inflight+1) × (ewmaNs+1) blends live load and an EWMA of latency. TestP2CRoutesAroundSlowEndpoint makes one of five endpoints 100× slower and shows it's picked < 1% of the time: a high score loses every pairing, so traffic flows around a "gray failure" (up but slow) automatically. Round-robin would send it a flat 20%.

Outlier ejection — and the panic threshold

Observe(endpoint, latency, ok) updates the EWMA and, on consecutive failures, ejects the endpoint for a growing cool-down. TestOutlierEjectionRespectedByPick confirms Pick never returns an ejected endpoint; TestEjectionExpires confirms it comes back after the cool-down. The maintainer detail is TestPanicModeWhenTooManyEjected: if more than half the fleet would be ejected (a fleet-wide brown-out, not one bad instance), the balancer stops ejecting and uses everyone. Without that panic threshold, a correlated failure ejects the whole fleet and you black-hole all traffic — turning a brown-out into a total outage. Envoy has this exact safeguard for the exact same reason.

2. Adaptive concurrency (limiter.go)

A fixed concurrency limit is always wrong: set for the healthy case, it becomes a queue-builder the moment a dependency slows (in-flight piles up to the limit, latency explodes). Adaptive infers the right limit from latency via Little's Law (L = λ × W): track the minimum (no-load) latency; when current latency climbs above it a queue is forming, so shrink the limit; when they're close, grow it. Excess requests are shed at admission — bounded latency under overload.

The tests pin the control loop:

TestAdaptiveGrowsWhenHealthy: at current == minRTT the gradient is 1, so the limit grows (toward max) to find the throughput knee.
TestAdaptiveShrinksUnderLatency: a 100× latency jump collapses the limit toward its fixed point (L = 0.5L + √L ⇒ L ≈ 4), proving it backs off hard when a queue forms.
TestAdaptiveShedsAtLimit: once in-flight hits the limit, Acquire returns false — the request is shed, not queued.

Why shed instead of queue? A request the client already timed out on is doomed work; queueing it wastes capacity that live requests need. This is congestion control for your service, and — like TCP's — it needs no magic-number tuning. Cite Netflix's concurrency-limits.

3. Retry budgets & circuit breakers (budget.go)

Budget is a token bucket: each served request accrues ratio tokens, each retry spends one, an empty bucket means no retry. TestRetryBudgetCapsAmplification shows 20 requests at ratio 0.5 permit exactly 10 retries, then no more. Why a budget and not a per-request retry count? Under correlated failure (every request failing at once), a per-request "retry 3×" multiplies the entire load by 4 — precisely when the dependency can least afford it. A budget bounds retries to a fraction of total traffic, so amplification is capped no matter how many requests fail. That's the difference between the naive and budgeted lines in the simulator.

CircuitBreaker is the CLOSED → OPEN → HALF-OPEN state machine. TestCircuitBreaker walks the full cycle: it opens after sustained failures (fail fast — stop calling a sick dependency and piling up doomed requests), rejects during the cool-down, lets a single probe through when half-open, reopens if the probe fails, and closes if it succeeds. Tune Threshold so it trips on sustained failure, not noise — too-sensitive a breaker flaps and cuts healthy traffic.

4. How they compose (the call path)

A resilient origin call layers all four:

budget.OnRequest()
if !breaker.Allow() { return fallback }            // fail fast on sick dep
rel, ok := limiter.Acquire(); if !ok { return 503 } // shed at overload
ep := p2c.Pick(); p2c.Acquire(ep)                   // load-aware endpoint
resp, err := callWithTimeout(ep)                    // bounded latency (gw-03)
p2c.Release(ep); p2c.Observe(ep, latency, err==nil) // feed LB + ejection
breaker.Record(err == nil)
rel(latency, err != nil)                            // feed the limiter
if retryable(err) && idempotent && budget.TryRetry() { ... }

Each guard refuses to amplify in a different way. Together they keep a brown-out contained instead of cascading.

5. Hands-on

cd src/go
bash ../scripts/verify.sh           # tests + the metastability demo

# Play with the budget ratio and watch the recovery threshold move:
# (edit resilsim's ratio, or parameterize it as an exercise)

6. Exercises

Wire it into gw-03: build a resilient http.RoundTripper that wraps ProxyEndpoint.Transport with budget + breaker + limiter + P2C over a gw-04 subset. Drive wrk through a flaky origin and watch the retries/requests ratio stay capped.
Add hedging: after p95, send a second request to another endpoint and take the first response; budget the hedges. Show it cuts the tail when the set is healthy and hurts under broad slowness — so gate it on the set being mostly healthy.
Make the breaker rate-based: open on error-rate-over-a-window with a minimum request volume (not consecutive failures), so a low-traffic endpoint doesn't trip on two unlucky errors.
Criticality-aware shedding: when the limiter sheds, drop retries before originals and background before interactive. Implement a priority and prove the core experience survives overload.
Reproduce metastability with real services: stand up a flaky origin and a load generator; show the naive-retry storm and the budget fix on actual retries/requests metrics (gw-11).

gw-06 — References

Foundational reading

Google SRE Book — "Handling Overload," "Addressing Cascading Failures," "Load Balancing in the Datacenter." The canonical treatment of retries, budgets, shedding, and graceful degradation. https://sre.google/sre-book/handling-overload/
Marc Brooker — Timeouts, retries, and backoff with jitter (AWS Builders' Library) and the metastable-failure posts. The retry-storm and jitter canon. https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
Metastable Failures in Distributed Systems (HotOS 2021, Bronson et al.) — the formal framing of self-sustaining overload.

Load balancing

Mitzenmacher, The Power of Two Choices in Randomized Load Balancing — the P2C result (exponential reduction in max load).
Twitter Finagle — P2C + EWMA ("p2c / leastLoaded") and the Aperture subsetting balancer. https://twitter.github.io/finagle/guide/Clients.html
Envoy LB docs — P2C (LEAST_REQUEST), outlier detection, zone-aware routing, circuit breakers. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/load_balancers

Adaptive concurrency

Netflix concurrency-limits (Java) — the Gradient/Vegas-style adaptive limiter; read the README and the Limit implementations. https://github.com/Netflix/concurrency-limits
Netflix TechBlog — Performance Under Load (adaptive concurrency limits at the edge). https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581
Envoy adaptive concurrency filter (the same idea in the data plane).
TCP Vegas / congestion-control papers — the control-loop intuition.

Circuit breaking & bulkheads

Michael Nygard, Release It! — circuit breakers, bulkheads, the stability patterns vocabulary.
Netflix Hystrix (archived) + resilience4j — circuit breaker / bulkhead / rate limiter implementations to read.

Cross-lab dependencies

Upstream: gw-03 (endpoint phase + timeout), gw-04 (subset to balance over).
Downstream: gw-08 (Envoy implements all of this via xDS), gw-11 (the metrics that make it observable), gw-05 (reconnect backoff).

gw-06 — Analysis

The design-review treatment of the LB + adaptive limiter you build, with the trade-offs to defend.

Required behaviors

Load-aware balancing. The balancer must route around a momentarily-slow endpoint, not walk into it (round-robin fails this).
Bounded amplification. Retries can never exceed a configured fraction of base traffic; verified by a retries/requests metric.
Fail fast on sick dependencies. A circuit breaker opens on sustained failure (rate over a window with a min-volume floor), not on noise, and probes recovery in half-open.
Bounded in-flight under overload. Adaptive concurrency keeps latency bounded by shedding at admission past the knee, with no magic-number tuning.
Graceful degradation. When shedding, the least-important traffic goes first; the core experience is protected.

Design decisions

P2C with EWMA latency. The balancer picks two random endpoints and takes the one with the lower EWMA in-flight/latency score. EWMA (not instantaneous) damps noise; two random choices avoid global state. The step measures max-load vs round-robin under a skewed endpoint.
Outlier ejection from live traffic. Passive: count consecutive failures / compare latency to the set median; eject for a cool-down that grows with repeated ejections; probe back. Reacts faster than active health checks and catches "up but broken."
Gradient-style adaptive limiter. Track RTT_noload (a long-window minimum) and current RTT; limit ← limit × (RTT_noload / RTT_current) bounded by AIMD-style growth/backoff, à la Netflix concurrency-limits. Over-limit requests get a fast 503. No fixed thread pool to mis-size.
Retry budget as a token bucket. Retries draw from a bucket refilled at budget × request_rate; empty bucket = no retry. This caps amplification globally, unlike per-request retry counts which multiply under correlated failure.

Tradeoffs worth flagging

P2C needs in-flight counts per endpoint. Cheap locally, but with per-event-loop state (gw-04) each loop has a partial view. Decide whether load counters are per-loop or shared; per-loop is lock-free but slightly less globally accurate — usually fine.
Outlier ejection can cascade. Eject too aggressively during a fleet-wide brown-out and you eject everyone, leaving no capacity. Cap the fraction of a set you'll eject (e.g. ≤50%) — a panic threshold, exactly as Envoy does.
Adaptive concurrency vs bursty traffic. A limiter tuned for steady traffic can shed legitimate bursts; too sloppy and it lets a queue build. The min-RTT window length is the key knob; document it and how you'd tune it from data.
Retry budgets need a shared denominator. "10% of traffic" is fleet-wide; a purely per-instance budget can still amplify if failures correlate. Decide the scope and accept the approximation if you keep it per-instance.
Hedging vs load. Hedging cuts the tail but adds duplicate load (budgeted, e.g. only hedge 5% of requests at p95). Under broad slowness, hedging makes things worse — gate it on the set being mostly healthy.

What production adds beyond this lab

Per-route/per-cluster policy from the control plane (gw-08): different timeouts, retry budgets, circuit thresholds, and limits per backend.
Request criticality/priority propagated end-to-end so shedding is semantically correct, not just "drop the newest."
Coordinated bulkheads: isolate resource pools per dependency so one sick dependency can't starve others.
Distributed rate limiting (a shared token store) when per-instance limits aren't enough (gw-03's "stateless gateway holds state" point).
Full observability: retries/requests, circuit state transitions, ejection events, limiter value vs in-flight, shed counts by criticality (gw-11).

gw-06 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd gw-06-resilience-load-balancing && bash scripts/verify.sh   # tests + retry-storm demo

Per-language workflow (Go)

cd gw-06-resilience-load-balancing/src/go
go test -race -count=1 ./...        # P2C, ejection, adaptive limiter, budget, breaker
go run ./cmd/resilsim               # metastability vs retry budget

Package map

File	What
`resil/balancer.go`	P2C + EWMA scoring + outlier ejection + panic threshold
`resil/limiter.go`	adaptive concurrency limiter (Little's Law / gradient), shed-at-admission
`resil/budget.go`	retry budget (token bucket) + circuit breaker (CLOSED/OPEN/HALF-OPEN)
`cmd/resilsim`	retry-storm / metastable-failure demonstration

resilsim output (what to look for)

naive offered load stays pinned over capacity after the spike clears (metastable collapse).
budgeted returns to baseline immediately (recovered).

See GUIDE.md for the full toolkit and the composed call-path.

gw-06 — Verification

One command

cd gw-06-resilience-load-balancing && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestP2CRoutesAroundSlowEndpoint`	P2C picks a 100×-slow endpoint <1% of the time (round-robin would send 20%)
`TestOutlierEjectionRespectedByPick`	an ejected endpoint is never picked
`TestEjectionExpires`	ejection clears after the (growing) cool-down
`TestPanicModeWhenTooManyEjected`	if >50% would be ejected, the balancer uses everyone (never black-holes)
`TestAdaptiveGrowsWhenHealthy`	limit grows when current latency ≈ no-load minimum
`TestAdaptiveShrinksUnderLatency`	a 100× latency jump collapses the limit (toward its ~4 fixed point)
`TestAdaptiveShedsAtLimit`	requests over the limit are shed at admission, not queued
`TestRetryBudgetCapsAmplification`	retries are capped at the budget fraction of traffic
`TestCircuitBreaker`	full CLOSED→OPEN→HALF-OPEN cycle: fail-fast, probe, reopen-on-fail, close-on-success

All under -race (balancers/limiters/breakers use atomics + injectable clocks for determinism).

Demo assertion (resilsim, in verify.sh)

naive retries → offered load stuck over capacity after the spike (metastable),
retry budget → recovers to baseline immediately.

What "green" does NOT guarantee

Not wired into a live proxy. Composing budget+breaker+limiter+P2C into a resilient http.RoundTripper is an exercise (GUIDE §6.1).
No hedging / criticality shedding in the core package (exercises §6.2/§6.4).
Per-instance budgets/limits. Fleet-wide correctness for distributed rate limiting needs a shared store (CONCEPTS §3 / gw-03 note).

gw-06 step 01 — Power-of-two-choices load balancing + outlier ejection

Goal

Implement P2C load balancing with EWMA latency scoring and passive outlier ejection, then show it routes around a slow endpoint where round-robin does not.

Code — `src/go/balancer.go`

package lb

import (
	"math/rand"
	"sync"
	"sync/atomic"
	"time"
)

type Endpoint struct {
	Addr     string
	inflight atomic.Int64
	ewmaNs   atomic.Int64 // EWMA of latency in ns
	ejectedUntil atomic.Int64 // unix-nano; 0 = healthy
	fails    atomic.Int64
}

// score: lower is better. Combine in-flight and latency EWMA.
func (e *Endpoint) score() float64 {
	return float64(e.inflight.Load()+1) * float64(e.ewmaNs.Load()+1)
}

func (e *Endpoint) ejected(now int64) bool { return e.ejectedUntil.Load() > now }

type P2C struct {
	mu   sync.RWMutex
	eps  []*Endpoint
	rnd  *rand.Rand
}

func NewP2C(addrs []string) *P2C {
	b := &P2C{rnd: rand.New(rand.NewSource(time.Now().UnixNano()))}
	for _, a := range addrs {
		b.eps = append(b.eps, &Endpoint{Addr: a})
	}
	return b
}

// Pick returns an endpoint via power-of-two-choices among healthy ones.
func (b *P2C) Pick() *Endpoint {
	b.mu.RLock()
	defer b.mu.RUnlock()
	now := time.Now().UnixNano()
	healthy := make([]*Endpoint, 0, len(b.eps))
	for _, e := range b.eps {
		if !e.ejected(now) {
			healthy = append(healthy, e)
		}
	}
	if len(healthy) == 0 {
		healthy = b.eps // panic mode: everyone ejected -> use all (don't black-hole)
	}
	if len(healthy) == 1 {
		return healthy[0]
	}
	i := b.rnd.Intn(len(healthy))
	j := b.rnd.Intn(len(healthy) - 1)
	if j >= i {
		j++ // distinct second choice
	}
	a, c := healthy[i], healthy[j]
	if a.score() <= c.score() {
		return a
	}
	return c
}

// Observe records the result of a call for EWMA + ejection.
func (b *P2C) Observe(e *Endpoint, latency time.Duration, ok bool) {
	// EWMA: new = alpha*sample + (1-alpha)*old
	const alpha = 0.2
	old := e.ewmaNs.Load()
	sample := latency.Nanoseconds()
	e.ewmaNs.Store(int64(alpha*float64(sample) + (1-alpha)*float64(old)))

	if ok {
		e.fails.Store(0)
		return
	}
	// Consecutive-failure ejection with growing cool-down.
	n := e.fails.Add(1)
	if n >= 5 {
		cool := time.Duration(n) * 2 * time.Second
		e.ejectedUntil.Store(time.Now().Add(cool).UnixNano())
	}
}

The experiment — P2C vs round-robin under a slow endpoint

setup: 5 endpoints; endpoint #3 injected with 10x latency (a "gray
failure" — up but slow).

round-robin: 1/5 of requests hit #3 and eat its 10x latency; the p99 is
dominated by #3. Throughput drops because in-flight piles up on #3.

P2C + EWMA: as #3's EWMA latency climbs, it almost always loses the
"two choices" comparison, so it gets a shrinking share. The p99 stays
near the healthy endpoints'. If #3 starts erroring, outlier ejection
removes it entirely.

Tasks

Implement P2C, Observe, EWMA, and ejection.
Stand up 5 mock endpoints; inject 10× latency on one. Drive load and compare the request distribution and p99 for round-robin vs P2C.
Flip the slow endpoint to erroring; confirm outlier ejection removes it within ~5 failures and probes it back after the cool-down.
Add the panic threshold: when >50% would be ejected, stop ejecting (a fleet-wide brown-out must not black-hole all traffic).

Acceptance

P2C sends a shrinking share to the slow endpoint; p99 stays near the healthy set's. Round-robin's p99 is dragged by the slow one.
Outlier ejection removes a failing endpoint and restores it after cool-down; the panic threshold prevents ejecting everyone.

Discussion prompts

Why is EWMA (not the last latency) the right signal? What does a too- high vs too-low alpha do?
Why does P2C beat round-robin so much for almost no extra cost? (The "power of two choices" exponential max-load result.)
With per-event-loop state (gw-04), each loop sees only its own in-flight counts. Is per-loop P2C good enough, or do you need shared counters? Argue it.

gw-06 step 02 — Adaptive concurrency limiting

Goal

Replace a fixed concurrency limit with one that adapts from observed latency, so the gateway keeps latency bounded under overload by shedding at admission — Netflix's concurrency-limits idea in miniature.

Theory in one paragraph

By Little's Law, in-flight L = λ × W. While the system is below its knee, raising the in-flight limit raises throughput at ~flat latency. Past the knee, latency rises but throughput doesn't — you're just queueing. So: track the minimum latency ever seen (RTT_noload, the uncontended baseline) and the current latency. If current ≈ minimum, there's headroom — grow the limit. If current >> minimum, a queue is forming — shrink it. Reject requests over the limit immediately (fast 503) instead of letting them queue.

Code — `src/go/limiter.go`

package limit

import (
	"math"
	"sync"
	"sync/atomic"
	"time"
)

// Gradient2-style adaptive limiter (simplified). limit grows when
// current latency is near the no-load minimum, shrinks when it climbs.
type Adaptive struct {
	mu       sync.Mutex
	limit    float64
	minRTT   time.Duration // long-window minimum (no-load baseline)
	inflight atomic.Int64

	minLimit, maxLimit float64
}

func New() *Adaptive {
	return &Adaptive{limit: 20, minLimit: 1, maxLimit: 2000, minRTT: time.Hour}
}

// Acquire admits a request if under the current limit. Returns a release
// func and true, or false if the request must be shed.
func (a *Adaptive) Acquire() (release func(rtt time.Duration, dropped bool), ok bool) {
	if float64(a.inflight.Add(1)) > a.curLimit() {
		a.inflight.Add(-1)
		return nil, false // SHED at admission — bounded latency under load
	}
	start := time.Now()
	return func(rtt time.Duration, dropped bool) {
		a.inflight.Add(-1)
		if rtt == 0 {
			rtt = time.Since(start)
		}
		a.update(rtt, dropped)
	}, true
}

func (a *Adaptive) curLimit() float64 {
	a.mu.Lock()
	defer a.mu.Unlock()
	return a.limit
}

// update adjusts the limit from the latest sample (the control loop).
func (a *Adaptive) update(rtt time.Duration, dropped bool) {
	a.mu.Lock()
	defer a.mu.Unlock()

	if rtt < a.minRTT {
		a.minRTT = rtt // track the uncontended baseline
	}
	if dropped {
		a.limit = math.Max(a.minLimit, a.limit*0.9) // multiplicative decrease on error
		return
	}
	// gradient = noload / current, clamped to [0.5, 1.0]
	gradient := math.Max(0.5, math.Min(1.0,
		float64(a.minRTT)/float64(rtt)))
	// queue size headroom allows additive growth when gradient ~ 1.
	newLimit := a.limit*gradient + math.Sqrt(a.limit) // headroom term
	a.limit = math.Max(a.minLimit, math.Min(a.maxLimit, newLimit))
}

The experiment — fixed vs adaptive under a latency cliff

phase 1 (healthy): origin RTT ~5ms. Both fixed(=200) and adaptive serve
fine; adaptive settles near the throughput knee.

phase 2 (dependency slows to 100ms):
  FIXED limit 200: in-flight rushes to 200, queue builds, latency =
    200 × 100ms / arrival ... explodes into seconds; clients time out;
    retries pile on (step 03).
  ADAPTIVE: current RTT (100ms) >> minRTT (5ms) -> gradient ~0.05 ->
    limit collapses toward minLimit; excess is shed with fast 503s;
    served requests keep ~bounded latency. The system stays in control.

Tasks

Implement Adaptive; wrap the endpoint call (Acquire → call origin → release(rtt, dropped)).
Run the two-phase experiment; plot served-latency p99 and shed-rate for fixed vs adaptive across the latency cliff.
Show that adaptive needs no magic number: it finds a reasonable limit from latency alone, and recovers (grows back) when the dependency heals.

Acceptance

Under the latency cliff, the fixed limiter's served latency explodes while the adaptive limiter holds latency ~bounded by shedding.
The adaptive limit shrinks during overload and grows back on recovery, with no hand-tuned threshold.

Discussion prompts

Why shed at admission rather than queue? (Bounded latency; a queued request that the client already timed out on is pure waste — "doomed work.")
How is this analogous to TCP congestion control (AIMD, RTT-based)?
Where does this sit relative to the connection pool (gw-04)? (Pool acquisition is itself a concurrency limit; reconcile the two so they don't fight.)

gw-06 step 03 — Reproduce a retry storm, then stop it with a budget

Goal

See retry amplification turn a small blip into a metastable collapse, then add a retry budget + backoff/jitter + circuit breaker and watch the system stay up. This is the most important resilience lesson and a guaranteed interview story.

Code — retry budget (token bucket) + circuit breaker

package resil

import (
	"sync"
	"time"
)

// Budget caps retries to a fraction of total request volume. Each
// request adds `ratio` tokens; each retry costs 1. Empty bucket => no
// retry. This bounds amplification GLOBALLY, unlike per-request counts.
type Budget struct {
	mu     sync.Mutex
	tokens float64
	ratio  float64 // e.g. 0.1 = retries may be at most ~10% of traffic
	max    float64
}

func NewBudget(ratio float64) *Budget {
	return &Budget{ratio: ratio, max: 100, tokens: 100}
}

func (b *Budget) OnRequest() {
	b.mu.Lock()
	b.tokens = min(b.max, b.tokens+b.ratio)
	b.mu.Unlock()
}

func (b *Budget) TryRetry() bool {
	b.mu.Lock()
	defer b.mu.Unlock()
	if b.tokens >= 1 {
		b.tokens--
		return true
	}
	return false // budget exhausted: DO NOT retry (amplification guard)
}

// CircuitBreaker: CLOSED -> OPEN on sustained failure -> HALF-OPEN probe.
type CircuitBreaker struct {
	mu          sync.Mutex
	state       int // 0=closed 1=open 2=half-open
	failures    int
	threshold   int
	openedAt    time.Time
	cooldown    time.Duration
}

func (c *CircuitBreaker) Allow() bool {
	c.mu.Lock()
	defer c.mu.Unlock()
	switch c.state {
	case 1: // open
		if time.Since(c.openedAt) > c.cooldown {
			c.state = 2 // half-open: allow a probe
			return true
		}
		return false // fail fast — don't call a known-sick dependency
	default:
		return true
	}
}

func (c *CircuitBreaker) Record(ok bool) {
	c.mu.Lock()
	defer c.mu.Unlock()
	if ok {
		c.failures = 0
		c.state = 0 // closed
		return
	}
	c.failures++
	if c.state == 2 || c.failures >= c.threshold {
		c.state = 1
		c.openedAt = time.Now()
	}
}

Code — the call path with all guards

func (g *Gateway) call(ep *Endpoint) (resp, error) {
	g.budget.OnRequest()
	if !g.breaker.Allow() {
		return fallback(), nil // circuit open: fast fallback, no doomed call
	}
	for attempt := 0; ; attempt++ {
		r, err := doWithTimeout(ep, g.timeout)
		g.breaker.Record(err == nil)
		if err == nil || !retryable(err) {
			return r, err
		}
		if attempt >= 1 || !g.idempotent || !g.budget.TryRetry() {
			return r, err // no retry: not idempotent, or budget empty
		}
		time.Sleep(backoffWithJitter(attempt)) // exp backoff + full jitter
	}
}

The experiment — blip → storm → recovery

A) NAIVE (3 retries, no budget, no breaker, no jitter):
   inject a 30s, 20%-error blip on the origin.
   -> each failing request retries 3x -> offered load ~1.6x during the
      blip -> origin pushed further past its knee -> error rate climbs ->
      MORE retries -> load stays high AFTER the blip ends -> metastable:
      the system does NOT recover on its own.

B) GUARDED (budget=10%, breaker, backoff+jitter, idempotent-only):
   same blip.
   -> retries capped at ~10% extra -> breaker opens when errors sustain,
      shedding doomed calls to a fast fallback -> offered load stays near
      baseline -> origin recovers as soon as the blip clears -> system
      self-heals.

Tasks

Implement the budget, breaker, and guarded call path.
Reproduce scenario A: plot offered load and error rate; show it stays elevated after the trigger clears (metastable).
Run scenario B: show offered load stays bounded and the system recovers immediately when the blip ends.
Add the metric retries / requests and confirm it's capped at the budget ratio in B and spikes uncontrolled in A (this is the gw-11 signal you'd alert on).

Acceptance

A reproduced metastable collapse with naive retries, and self-healing with budget + breaker + jitter.
A retries/requests metric you can point to as the early-warning signal, capped by the budget.

Discussion prompts

Why does a budget succeed where a smaller per-request retry count fails under correlated failure? (Per-request counts still multiply when every request fails at once; a budget bounds the aggregate.)
Why retry only idempotent requests, and how do you make a non-idempotent operation safely retryable? (Idempotency keys.)
Hedging vs retries: when does adding a second request help (tail latency) and when does it hurt (broad slowness)? Where's the budget?

gw-07 — Security at the Edge: mTLS, Identity, and Zero-Trust

The JD names "uplift the security… posture for traffic traversing the cloud" as a primary responsibility, and links the talk "The Show Must Go On: Securing Netflix Studios at Scale." Studio/partner traffic (pre-release content, production assets) is among the most sensitive at Netflix, and the gateway is where its security posture is enforced. The modern answer is zero-trust: never trust the network, authenticate every connection cryptographically, and authorize every request.

This lab covers the security mechanisms a Cloud Gateway engineer owns: TLS 1.3 termination, mutual TLS (mTLS), workload identity (SPIFFE/SVID; Netflix's Metatron), certificate issuance and rotation, SNI-based routing and passthrough, and authn/authz at the gateway (token validation, the fail-open vs fail-closed decision). The throughline: identity is the new perimeter.

You will stand up mutual TLS between a client and the gateway, validate the client's identity from its certificate, and enforce an authorization policy — the concrete core of zero-trust at the edge.

1. What is it?

Zero-trust discards the old "hard shell, soft center" model (trust anything inside the network) for "trust nothing; verify everything." Every connection proves who it is with a cryptographic credential, and every request is authorized against a policy. At the edge this means:

TLS termination — decrypt inbound TLS so the gateway can inspect and route (gw-02/gw-03). The gateway holds the server certificate and private key for the public hostnames.
mutual TLS (mTLS) — both sides present certificates. The client proves its identity to the gateway (and gateway→origin mTLS proves the gateway to the origin). This is how service-to-service traffic authenticates without passwords or network-location trust.
Workload identity — a short-lived certificate whose subject is the workload's identity (a SPIFFE ID like spiffe://netflix/studio/asset-service), issued automatically by an identity platform (SPIRE; Netflix's Metatron, which injects credentials at the continuous-delivery phase).
Authorization — given an authenticated identity, decide if this identity may perform this request (which route, method, resource).

client/workload ──mTLS (both present certs)──▶ [GATEWAY]
   identity = SPIFFE ID from client cert        verify cert chain + identity
                                                authorize(identity, request)
                                                ──mTLS──▶ origin (gateway proves itself)

2. Why does it matter?

It's a primary JD responsibility and a flagship talk. "Securing Netflix Studios at Scale" is about exactly this: protecting high-value content traffic with strong identity and policy at scale. Being fluent in mTLS, identity, and rotation is directly hireable.
The edge is the enforcement point. Authenticating and authorizing at the gateway means every backend service inherits a strong security posture without each one re-implementing TLS, token validation, and policy. Centralization (gw-03) applied to security.
Certificate operations are where security meets reliability. A cert that expires unrotated takes down a service — a self-inflicted outage that's depressingly common. Automated issuance + rotation (short-lived certs, hot reload without dropping connections) is an operational skill the role needs, and it ties back to draining (gw-01) and connection management (gw-04).
The fail-open/fail-closed decision is a senior judgment call. If the authz service is down, do you let traffic through (fail-open: available but possibly unauthorized) or block it (fail-closed: secure but unavailable)? The answer depends on the traffic's sensitivity, and articulating that trade-off is exactly the judgment a "5" is hired for.

3. How does it work?

The TLS 1.3 handshake (what you must be able to narrate)

client                                   server (gateway)
  │ ClientHello (SNI, ALPN, key_share, cipher suites)  │
  │ ───────────────────────────────────────────────▶  │
  │ ServerHello (key_share, cert, Finished)            │  selects cert by SNI
  │ ◀───────────────────────────────────────────────  │  (one listener, many hosts)
  │ [for mTLS: CertificateRequest]                     │
  │ Certificate (client) + CertificateVerify + Finished│  client proves identity
  │ ───────────────────────────────────────────────▶  │  verify chain + SPIFFE ID
  │ ════════════ application data (1-RTT) ════════════ │

SNI (Server Name Indication) tells the server which hostname the client wants before the cert is chosen — so one listener serves many domains, each with its own cert. SNI also enables TLS passthrough: an L4 proxy can route by SNI without decrypting (keeps splice, gw-01).
ALPN negotiates the protocol (h2, http/1.1, h3) inside the handshake.
mTLS adds the CertificateRequest + client Certificate + CertificateVerify — the client signs to prove it holds the private key for the cert it presented.

Workload identity: SPIFFE / SVID / Metatron

A SPIFFE ID is a URI naming a workload: spiffe://trust-domain/path (e.g. spiffe://netflix/studio/asset-service). It lives in the certificate's SAN (Subject Alternative Name, URI type).
An SVID (SPIFFE Verifiable Identity Document) is the X.509 cert carrying that ID, short-lived (hours), auto-rotated.
SPIRE is the reference issuer; it attests a workload (proves it's really asset-service via node + workload attestation) before issuing its SVID.
Metatron is Netflix's equivalent: it solves the credential bootstrap problem by injecting identity into each microservice at the continuous-delivery phase, so a service comes up already holding a rotatable identity — no secret-zero to leak.

Certificate rotation without dropping connections

Short-lived certs mean you rotate constantly. The gateway must:

Fetch the new cert/key before the old expires (a rotation agent / SDS — Secret Discovery Service in Envoy, gw-08).
Hot-swap it for new handshakes via an atomic pointer in the TLS config's GetCertificate callback — existing connections keep their session; new ones use the new cert.
Never block the event loop on a cert fetch (gw-03).

This is the same atomic-swap, no-dropped-request pattern as route hot-reload (gw-03) and pool membership (gw-04).

Authorization at the gateway

Once the identity is known (from the client cert, or a validated JWT/ token for end users), authorize the request:

authorize(identity, request):
    policy = lookup(route, method)
    if identity in policy.allowed_identities: ALLOW
    else: DENY (403)
    # decision may call an external authz service — async, timeout,
    # cached, with a fail-open/fail-closed policy per route sensitivity

For end-user traffic the gateway typically validates a signed token (JWT/PASETO): check signature, expiry, audience, scopes — locally (no per-request call) using the issuer's public keys (JWKS), refreshed periodically.

4. Core terminology

Term	Definition
Zero-trust	Never trust the network; authenticate every connection, authorize every request.
TLS termination	Decrypting inbound TLS at the gateway to inspect/route.
mTLS	Both peers present certificates; mutual cryptographic authentication.
SNI	Server Name Indication: the hostname in ClientHello; selects the cert and enables passthrough-by-SNI.
ALPN	Protocol negotiation (`h2`/`http1.1`/`h3`) within the TLS handshake.
SPIFFE ID	A URI naming a workload identity, carried in the cert SAN.
SVID	The short-lived X.509 cert carrying a SPIFFE ID.
SPIRE / Metatron	Identity issuers (open-source / Netflix) that attest workloads and rotate SVIDs.
SDS	Secret Discovery Service: pushes certs/keys to the data plane (Envoy/xDS, gw-08).
JWT / JWKS	Signed token / the issuer's public-key set for local validation.
Cert rotation	Replacing certs (often hourly) without dropping connections.
Fail-open / fail-closed	On authz outage: allow (available) vs deny (secure). A sensitivity-driven choice.
Trust domain	The boundary of a single identity authority (one CA root).

5. Mental models

mTLS is showing ID at both ends, not just the bouncer showing theirs. Regular TLS: the club proves it's the real club (server cert), you walk in anonymously. mTLS: you also show ID, so the club knows exactly who you are and can apply your membership tier (authz).
A SPIFFE ID is a passport; an SVID is the visa stamped in it. The identity is durable (spiffe://netflix/studio/asset-service); the document proving it (the cert) is short-lived and reissued constantly — so a stolen one expires fast, and revocation is "wait a few hours" instead of a fragile CRL/OCSP dance.
Short-lived certs trade a revocation problem for a rotation problem. Long certs need revocation infrastructure (CRL/OCSP, often broken); short certs sidestep it but demand rock-solid automated rotation. You're choosing which hard problem to own — and rotation is the more tractable one.
Fail-open vs fail-closed is a thermostat with a stuck sensor. If the sensor (authz service) dies, do you leave the heat on (fail-open, comfortable but maybe unsafe) or shut it off (fail-closed, safe but cold)? For a public catalog page you might fail-open; for pre-release studio assets you fail-closed. Same mechanism, opposite default by sensitivity.

6. Common misconceptions

"TLS means we're secure." TLS authenticates the server and encrypts the channel; it says nothing about who the client is or what they're allowed to do. Zero-trust needs mTLS (client identity)
- authorization (policy). Encryption ≠ authentication ≠ authorization.
"mTLS is too expensive at scale." The handshake cost is real (asymmetric crypto), which is exactly why connection reuse and pooling (gw-04) matter — amortize the handshake over many requests. Session resumption and h2 multiplexing further reduce it. Churn is the enemy of cheap mTLS.
"Long-lived certs are simpler." They move the cost to revocation (which usually doesn't work well in practice) and make a leaked key a long-lived disaster. Short-lived auto-rotated certs are the modern, more reliable default.
"Validate the JWT on every request by calling the auth service." That couples every request to the auth service's availability and latency. Validate signatures locally against cached JWKS; only the rare key-rotation needs a fetch.
"Fail-open is always safer for availability." It can be a security incident (serving unauthorized access to sensitive data). The right default depends on what's behind the route; treating it as a blanket rule is the mistake.

7. Interview talking points

"How would you secure service-to-service traffic at the edge?" mTLS with short-lived, auto-rotated workload certs (SPIFFE/SVID; Metatron-style bootstrap at deploy time), identity from the cert SAN, per-route authorization, all enforced at the gateway so backends inherit it. Name the rotation + hot-reload requirement.
"Walk me through the TLS handshake, and where mTLS changes it." ClientHello (SNI/ALPN/key_share) → ServerHello + cert → (mTLS: CertificateRequest → client Certificate + CertificateVerify) → Finished → 1-RTT data. The client's CertificateVerify signature is the proof it owns the key. SNI selects the server cert and enables L4 passthrough routing.
"How do you rotate certs without dropping connections?" Fetch ahead of expiry (SDS/rotation agent), atomic-swap via the GetCertificate callback so new handshakes use the new cert while existing connections keep theirs — same no-dropped-request pattern as route hot-reload (gw-03). Never block the event loop on the fetch.
"Fail-open or fail-closed if the authz service is down?" It depends on the data's sensitivity: fail-closed for studio/pre-release assets (security > availability), consider fail-open with degraded scope for low-sensitivity public traffic. Mitigate the dependency with local token validation + cached decisions so the question rarely arises. This trade-off articulation is the senior signal.
"Why short-lived certs?" They make revocation a non-problem (a leaked cert expires in hours) at the cost of demanding automated rotation — the more reliable trade. And they fit zero-trust: identity is continuously re-proven.
"How does mTLS interact with connection churn?" Every new connection pays a full asymmetric handshake; mTLS doubles the cert work. So mTLS makes gw-04's connection reuse a security-cost optimization, not just latency — a nice cross-lab connection to draw.

8. Connections to other labs

gw-01 (L4) — SNI-based TLS passthrough keeps you at L4 (and keeps splice); the handshake cost is the socket-level reason churn hurts.
gw-02 (L7) — ALPN negotiates h2/h3 in the handshake; termination is the prerequisite for L7 inspection.
gw-03 (API gateway) — authn/authz is the inbound filter; cert hot-reload is the same atomic-swap pattern as route reload.
gw-04 (connection management) — connection reuse amortizes the (expensive, doubled-for-mTLS) handshake.
gw-08 (Envoy/xDS) — SDS pushes certs to the data plane; the control plane distributes identity and policy.
gw-09 (Kubernetes) — pod identity, service-mesh mTLS, and secret distribution all live here; SPIRE runs as a node agent + workload API.

gw-07 — The Hitchhiker's Guide to Edge Security (mTLS & Zero-Trust)

Companion to CONCEPTS.md, with the runnable mTLS gateway in src/go/mtls/. Backed by the "Securing Netflix Studios at Scale" talk: high-value content traffic protected by strong identity and policy at the edge.

Zero-trust replaces "trust the network" with "trust nothing; verify everything." Concretely at the edge: every connection proves who it is with a certificate (mTLS), and every request is authorized against a policy. This lab builds all of it from crypto/tls and crypto/x509, including a tiny CA so the whole thing runs offline.

1. Identity: a CA and SPIFFE SVIDs (certs.go)

In production, identity comes from SPIRE or Netflix's Metatron, which attest a workload and issue it a short-lived certificate (an SVID) whose subject is its identity. We model that with a CA that issues leaves carrying a SPIFFE ID in the certificate's URI SAN:

cert, _ := ca.Issue("asset-service", "spiffe://netflix/studio/asset-service", false, time.Hour)

TestIssueAndExtractSpiffe issues such a cert and recovers the identity with SpiffeIDFromState. The key facts:

The identity lives in the URI SAN (spiffe://trust-domain/path), not the CN — that's the SPIFFE convention.
The cert is short-lived (ttl is small on purpose). Short certs make revocation a non-problem: a leaked cert expires in hours, so you trade a hard revocation problem (CRL/OCSP, which rarely works well) for a tractable rotation problem (§3). This is the modern, more reliable default — say that in an interview.
ExtKeyUsage differentiates server (ServerAuth) from client (ClientAuth) certs; the same CA issues both.

2. Mutual TLS (server.go, identity.go)

ServerTLSConfig is the heart:

&tls.Config{
    MinVersion:     tls.VersionTLS13,
    GetCertificate: holder.get,                  // hot rotation (§3)
    ClientAuth:     tls.RequireAndVerifyClientCert, // mTLS: DEMAND a verified client cert
    ClientCAs:      clientCAs,
    NextProtos:     []string{"h2", "http/1.1"},  // ALPN
}

RequireAndVerifyClientCert is what makes it mutual: the server won't complete the handshake unless the client presents a certificate chaining to a trusted CA. TestMTLSEndToEnd proves all three outcomes against a real TLS listener:

authorized identity + allowed path → 200 (and the handler sees the identity in X-Spiffe-Id),
valid cert but wrong identity for the route → 403,
no client cert → the handshake itself fails (the request never reaches the handler).

That last one is the zero-trust property in one assertion: an unauthenticated client cannot even open a connection, let alone send a request.

Authorization and the trust boundary

Authorize middleware pulls the identity from r.TLS — the cryptographically verified peer certificate — and checks a Policy (longest-prefix path → allowed SPIFFE IDs, default-deny). TestPolicyAllow covers allow/deny/wildcard/default-deny.

The single most important security rule in this whole phase: identity comes from the verified cert, never from a client-controlled header. A handler that trusts an inbound X-User or X-Spiffe-Id from the client is trivially spoofable. We set X-Spiffe-Id after verifying, to pass identity to trusted downstreams — but we'd strip any inbound copy first. This is the exact same trust boundary as the PROXY protocol (gw-01) and X-Forwarded-For (gw-03).

The fail-open vs fail-closed decision (what to do if an external authz dependency is down) is a per-route judgment call by data sensitivity — fail-closed for pre-release studio assets, perhaps fail-open with degraded scope for a public catalog page. Articulating that trade-off is the senior signal; the lab keeps authz local (no external dependency) so the question is yours to design (exercise §6.4).

3. Zero-downtime rotation (server.go)

Short certs mean you rotate constantly — so rotation must never drop a connection. CertHolder stores the active cert in an atomic.Pointer read by the TLS config's GetCertificate callback. A rotation agent calls Set with a fresh cert; new handshakes use the new cert while existing connections are untouched. TestHotRotation confirms the active serial changes on Set; the mtlsgw CLI rotates every 30s live.

This is the same atomic-swap, no-dropped-request pattern as gw-03 route reload and gw-04 membership — once you see it three times, you own it. In Envoy this is SDS (Secret Discovery Service) pushing certs over xDS (gw-08); the mechanism here is the local version of that.

4. Why this ties to connection churn (gw-04)

Every new mTLS connection pays a doubled asymmetric handshake (both sides verify a cert). So mTLS makes gw-04's connection reuse a security-cost optimization, not just a latency one: pooling and h2 multiplexing amortize the expensive handshake over many requests. "mTLS is too expensive at scale" is really "connection churn under mTLS is expensive" — and the fix is gw-04, plus TLS 1.3 session resumption.

5. Hands-on

cd src/go
bash ../scripts/verify.sh        # tests -race

go run ./cmd/mtlsgw              # generates a CA + certs, prints a curl command
# in another shell, use the printed command:
curl --cacert /tmp/gw07-ca.crt --cert /tmp/gw07-client.crt --key /tmp/gw07-client.key \
     https://127.0.0.1:8443/studio/assets        # 200, authorized as the SPIFFE id
curl --cacert /tmp/gw07-ca.crt https://127.0.0.1:8443/studio/assets   # TLS error: no client cert
# watch the log: the server cert rotates every 30s with zero dropped connections.

Inspect a cert to see the SPIFFE URI SAN:

openssl x509 -in /tmp/gw07-client.crt -text -noout | grep -A1 'Subject Alternative'

6. Exercises

mTLS to the origin too: have the gateway present its own SVID when dialing the origin (end-to-end identity), and propagate the client identity downstream in request context.
JWT for end users: add local JWT validation (verify signature against cached JWKS, check exp/aud/scopes) for human traffic, with no per-request call to the auth service. Combine with mTLS for service-to-service.
Fail-open vs fail-closed: add an external authz check with a timeout and make the failure policy per-route; demonstrate both behaviors and justify the default for a sensitive vs a public route.
SNI passthrough: implement an L4 path that routes by SNI without terminating TLS (keeping splice, gw-01) for traffic that must stay end-to-end encrypted; contrast with termination's L7 powers.
Rotation under load: drive wrk/curl in a loop against mtlsgw across several rotations and confirm zero failed requests — proving the atomic swap is truly zero-downtime.

gw-07 — References

The Netflix angle (named in the JD)

The Show Must Go On: Securing Netflix Studios at Scale — the talk; high-value content traffic, identity, and policy at scale.
Netflix Metatron — credential bootstrap: inject identity into each microservice at the continuous-delivery phase (the secret-zero solution). Referenced across Netflix security talks/posts.
Zero Configuration Service Mesh with On-Demand Cluster Discovery — Netflix TechBlog; Envoy + mTLS identity in the mesh direction. https://netflixtechblog.com/zero-configuration-service-mesh-with-on-demand-cluster-discovery-ac6483b52a51

Identity standards & implementations

SPIFFE / SPIRE — workload identity, SVIDs, attestation, the Workload API. https://spiffe.io/ · https://github.com/spiffe/spire
Envoy SDS (Secret Discovery Service) — pushing certs/keys to the data plane and rotating them (gw-08). https://www.envoyproxy.io/docs/envoy/latest/configuration/security/secret
cert-manager (Kubernetes) — automated issuance/rotation of certs from an issuer/CA. https://cert-manager.io/

Protocol specs

RFC 8446 — TLS 1.3 (the handshake, 0-RTT, session resumption).
RFC 6066 — TLS extensions incl. SNI. ALPN — RFC 7301.
RFC 5280 — X.509 certificates and path validation; SAN URI for SPIFFE IDs.
RFC 7519 — JWT; RFC 7517 — JWK/JWKS for local validation.

Background

BeyondCorp (Google) papers — the canonical zero-trust write-ups. https://research.google/pubs/?q=beyondcorp
NIST SP 800-207 — Zero Trust Architecture (the reference model).
Cloudflare/Smallstep blogs on short-lived certificates and why revocation (CRL/OCSP) is hard.

Tooling

openssl s_client -connect host:443 -servername host — inspect the handshake, cert chain, SNI, ALPN.
step (smallstep) — create a CA, issue certs, build an mTLS demo fast.
openssl x509 -text / openssl verify — read and validate certs.

Cross-lab dependencies

Upstream: gw-01 (handshake cost / SNI passthrough), gw-03 (inbound filter + hot-reload pattern), gw-04 (amortize the handshake).
Downstream: gw-08 (SDS), gw-09 (pod identity / mesh mTLS).

gw-07 — Analysis

The design review for the mTLS gateway you build in steps/.

Required behaviors

Mutual authentication. The gateway requires and verifies a client certificate chain to a trusted CA; an untrusted or expired client cert is rejected at the handshake.
Identity extraction. The workload identity (SPIFFE ID from the SAN, or CN) is extracted and made available to authz as the authenticated principal.
Per-route authorization. Given the identity, a policy decides allow/deny per route + method; denials are 403, not 500.
Hot cert rotation. New certs are picked up for new handshakes via atomic swap, with zero dropped connections and no event-loop block.
Explicit failure policy. The behavior when the authz dependency is unavailable is configured per route (fail-open vs fail-closed), not accidental.

Design decisions

tls.Config.GetCertificate for hot rotation. The server cert is read from an atomic.Pointer[tls.Certificate]; a rotation goroutine swaps it. New handshakes get the new cert; in-flight connections are untouched. Same pattern as gw-03 route reload and gw-04 membership.
VerifyPeerCertificate for identity + policy. Beyond chain validation, a custom verifier extracts the SPIFFE ID from the SAN and rejects identities outside the trust domain early (before any request work).
Authz as an interface with a local fast path. End-user tokens are validated locally against cached JWKS (no per-request network call); service identities come from the cert. Only coarse, cacheable policy decisions may call an external service, always with a timeout and a per-route fail policy.
Trust boundary made explicit. The gateway only trusts client identity from a verified cert (or a token it validated itself) — never a plaintext header like X-User from the client, and never a PROXY header (gw-01) or X-Forwarded-For from an untrusted peer.

Tradeoffs worth flagging

mTLS handshake cost vs reuse. Doubling the cert work per handshake makes connection reuse (gw-04) a security-cost optimization. Quantify: at N new connections/sec, mTLS handshakes are a measurable CPU line item; pooling + resumption + h2 multiplexing are the levers.
Short-lived certs vs rotation reliability. Hourly certs make revocation a non-issue but make the rotation pipeline a Tier-0 dependency — if rotation breaks, everything expires together. Monitor cert age and alert well before expiry; have a break-glass longer- lived cert.
Fail-open vs fail-closed. A single global default is wrong; it must be per-route by sensitivity. The cost of getting it wrong is either an outage (fail-closed on a flaky authz dep) or a breach (fail-open on sensitive data). Document the decision per route.
Termination vs passthrough. Terminating TLS enables L7 policy but forfeits splice and means the gateway holds private keys (a juicy target). SNI passthrough keeps keys at the origin and stays L4 but gives up L7 inspection. Choose per traffic class.

What production adds beyond this lab

A real issuer (SPIRE/Metatron/cert-manager) with workload attestation and the Workload API / SDS push to the data plane (gw-08).
Certificate transparency / inventory and expiry alerting across the fleet.
mTLS to the origin as well as from the client (end-to-end identity), and identity propagation in request context for downstream authz.
A policy engine (OPA/Rego or an internal equivalent) with versioned, reviewed, hot-reloadable policy — and audit logging of authz decisions (gw-11).
DDoS / WAF / bot-management stages and TLS fingerprinting at the edge.

gw-07 — Execution

Prerequisites

Go ≥ 1.25 (stdlib crypto only, offline). Optional: curl, openssl.

One-shot

cd gw-07-edge-security-mtls && bash scripts/verify.sh   # → "=== gw-07 OK ==="

Per-language workflow (Go)

cd gw-07-edge-security-mtls/src/go
go test -race -count=1 ./...      # CA/issue, policy, rotation, mTLS end-to-end
go run ./cmd/mtlsgw               # self-contained mTLS gateway demo

Run the demo

mtlsgw generates an ephemeral CA + server cert + client cert, writes the client material to /tmp, serves an mTLS endpoint with SPIFFE authz, and rotates the server cert every 30s. It prints the exact curl to run:

curl --cacert /tmp/gw07-ca.crt --cert /tmp/gw07-client.crt --key /tmp/gw07-client.key \
     https://127.0.0.1:8443/studio/assets    # 200
curl --cacert /tmp/gw07-ca.crt https://127.0.0.1:8443/studio/assets   # handshake refused

Package map

File	What
`mtls/certs.go`	a tiny CA; issue leaves with SPIFFE URI SANs; PEM encode
`mtls/server.go`	hot-rotating server TLS config (mTLS required), client config
`mtls/identity.go`	SPIFFE extraction from the verified cert; policy + authz middleware
`cmd/mtlsgw`	runnable mTLS gateway with live cert rotation

See GUIDE.md for the deep dive (trust boundary, fail-open vs fail-closed, churn link).

gw-07 — Verification

One command

cd gw-07-edge-security-mtls && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestIssueAndExtractSpiffe`	a leaf carries its SPIFFE ID in the URI SAN and it round-trips out via `SpiffeIDFromState`
`TestPolicyAllow`	longest-prefix authz: allow / deny-wrong-identity / wildcard / default-deny
`TestHotRotation`	`CertHolder.Set` changes the active certificate (the rotation mechanism)
`TestMTLSEndToEnd`	over a real TLS listener: authorized→200, wrong identity→403, no client cert→handshake fails

All under -race.

What "green" does NOT guarantee

Lab CA, not a real issuer. Production uses SPIRE/Metatron with workload attestation + SDS push (gw-08).
No origin-side mTLS / JWT. End-to-end identity and end-user token validation are exercises (GUIDE §6.1/§6.2).
Authz is local. The fail-open/fail-closed decision for an external authz dependency is an exercise (GUIDE §6.3) — and a key design call.
Termination only. SNI passthrough (keeping splice, gw-01) is an exercise (GUIDE §6.4).

Manual check

go run ./cmd/mtlsgw   # then run the printed curl; confirm 200 with cert,
                      # handshake error without; watch the rotation log.
openssl x509 -in /tmp/gw07-client.crt -text -noout | grep -A1 'Alternative'

gw-07 step 01 — mTLS, identity extraction, and authorization

Goal

Stand up mutual TLS between a client and the gateway, extract the client's SPIFFE identity from its certificate, enforce a per-route authorization policy, and hot-rotate the server cert with zero dropped connections.

Set up a CA and certs (smallstep `step` makes this 2 minutes)

step certificate create "Edge CA" ca.crt ca.key \
  --profile root-ca --no-password --insecure

# Gateway server cert (hostname in SAN):
step certificate create gateway.local gw.crt gw.key \
  --profile leaf --ca ca.crt --ca-key ca.key --no-password --insecure \
  --san gateway.local

# Client workload cert with a SPIFFE ID in the URI SAN:
step certificate create asset-service client.crt client.key \
  --profile leaf --ca ca.crt --ca-key ca.key --no-password --insecure \
  --san spiffe://netflix/studio/asset-service

Code — `src/go/mtls.go`

package mtls

import (
	"crypto/tls"
	"crypto/x509"
	"errors"
	"net/http"
	"os"
	"sync/atomic"
)

// certHolder enables hot rotation: GetCertificate reads an atomically
// swappable cert so new handshakes use the latest without dropping
// existing connections.
type certHolder struct{ cur atomic.Pointer[tls.Certificate] }

func (h *certHolder) get(*tls.ClientHelloInfo) (*tls.Certificate, error) {
	return h.cur.Load(), nil
}

func (h *certHolder) load(certFile, keyFile string) error {
	c, err := tls.LoadX509KeyPair(certFile, keyFile)
	if err != nil {
		return err
	}
	h.cur.Store(&c)
	return nil
}

// NewServerTLS builds a mutual-TLS config: requires + verifies a client
// cert chained to caFile, and hot-rotates the server cert.
func NewServerTLS(certFile, keyFile, caFile string) (*tls.Config, *certHolder, error) {
	caPEM, err := os.ReadFile(caFile)
	if err != nil {
		return nil, nil, err
	}
	pool := x509.NewCertPool()
	if !pool.AppendCertsFromPEM(caPEM) {
		return nil, nil, errors.New("bad CA pem")
	}
	h := &certHolder{}
	if err := h.load(certFile, keyFile); err != nil {
		return nil, nil, err
	}
	cfg := &tls.Config{
		MinVersion:     tls.VersionTLS13,
		GetCertificate: h.get,                       // hot rotation hook
		ClientAuth:     tls.RequireAndVerifyClientCert, // mTLS: demand a client cert
		ClientCAs:      pool,
		NextProtos:     []string{"h2", "http/1.1"},  // ALPN
	}
	return cfg, h, nil
}

// SpiffeID pulls the SPIFFE URI SAN from the verified client cert.
func SpiffeID(r *http.Request) (string, bool) {
	if r.TLS == nil || len(r.TLS.PeerCertificates) == 0 {
		return "", false
	}
	for _, u := range r.TLS.PeerCertificates[0].URIs {
		if u.Scheme == "spiffe" {
			return u.String(), true
		}
	}
	return "", false
}

Code — authorization middleware (the inbound security filter)

package mtls

import "net/http"

// Policy: which SPIFFE identities may hit which path prefixes.
var policy = map[string][]string{
	"/studio/assets": {"spiffe://netflix/studio/asset-service"},
	"/healthz":       {"*"}, // open
}

func Authorize(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		id, ok := SpiffeID(r) // identity from the verified cert — NOT a header
		if !ok {
			http.Error(w, "no client identity", http.StatusUnauthorized) // 401
			return
		}
		allowed := policyFor(r.URL.Path)
		if !contains(allowed, id) && !contains(allowed, "*") {
			http.Error(w, "forbidden for "+id, http.StatusForbidden) // 403
			return
		}
		next.ServeHTTP(w, r)
	})
}

Run + rotate

cfg, holder, _ := NewServerTLS("gw.crt", "gw.key", "ca.crt")
srv := &http.Server{Addr: ":8443", TLSConfig: cfg, Handler: Authorize(routes)}

// Rotation goroutine: reload the cert periodically (or on SDS push).
go func() {
	for range time.Tick(1 * time.Hour) {
		_ = holder.load("gw.crt", "gw.key") // new handshakes pick it up atomically
	}
}()
srv.ListenAndServeTLS("", "") // certs come from TLSConfig

Test it:

# With the right client cert + identity -> 200:
curl --cacert ca.crt --cert client.crt --key client.key \
     --resolve gateway.local:8443:127.0.0.1 https://gateway.local:8443/studio/assets

# Without a client cert -> handshake fails (mTLS required):
curl --cacert ca.crt https://gateway.local:8443/studio/assets   # TLS error

# Right cert, wrong route for that identity -> 403.

Tasks

Build the mTLS server; confirm a valid client cert reaches /studio/assets, a missing client cert fails the handshake, and a valid-but-unauthorized identity gets 403.
Hot-rotate: while wrk/curl loops against the gateway, reissue gw.crt and call holder.load; confirm new connections use the new cert and no in-flight request is dropped.
Add the fail policy: make Authorize call a mock external authz that you can turn off; implement per-route fail-open vs fail-closed and demonstrate both.

Acceptance

mTLS enforced (no client cert ⇒ no connection); identity extracted from the cert SAN; per-route authz returns 401/403 correctly.
Cert rotation with zero dropped requests.
A working, configurable fail-open/fail-closed behavior with a clear per-route rationale.

Discussion prompts

Why extract identity from the verified cert and never from a request header? (A header is client-controlled; the cert is cryptographically proven. Ties to the gw-01 PROXY-protocol / gw-03 XFF trust boundary.)
Why does RequireAndVerifyClientCert reject at the handshake rather than returning a 401? What does that save you?
Each new mTLS connection pays a doubled asymmetric handshake. Connect this to gw-04: how much CPU does connection reuse save here, and how would you measure it?

gw-08 — Envoy & the xDS Control Plane

The JD lists "Envoy Gateway, Kubernetes Gateway API, Istio Gateway" and, crucially, splits the work into "data plane and control plane." This lab is where that split becomes concrete. Envoy is the modern canonical data plane — a high-performance C++ L4/L7 proxy. By itself it proxies whatever its static config says; its power comes from being dynamically configured by a control plane over the xDS protocol. The control plane is the source of truth (which listeners, routes, clusters, endpoints, secrets exist); the data plane is a fleet of Envoys that fetch and apply that config in real time.

This is the same data-plane/control-plane architecture as Netflix's move toward a service mesh ("Zero Configuration Service Mesh with On-Demand Cluster Discovery"), and it's the design pattern behind the Gateway API operator in gw-10. Understanding xDS is understanding how a gateway fleet is operated at scale: you don't redeploy 80 clusters to change a route — the control plane pushes it.

You will build a minimal xDS control plane in Go using go-control-plane and drive a real Envoy with it, watching config flow from your server into the live proxy.

1. What is it?

Envoy is a self-contained proxy with a clean internal model:

Listener  — a port Envoy listens on (e.g. :443)
  └─ Filter chain — ordered network/HTTP filters (the gw-03 idea, in C++)
       └─ HTTP Connection Manager — terminates L7, runs HTTP filters
            └─ Route config — match request → route → cluster
                 └─ Cluster — a logical upstream service (a backend)
                      └─ Endpoints — the concrete instances (ip:port) of that cluster

Each layer is configurable dynamically via a corresponding xDS ("x" Discovery Service) API:

API	Discovers	Maps to
LDS	Listeners	ports + filter chains
RDS	Routes	the route table (gw-03)
CDS	Clusters	upstream services
EDS	Endpoints	the instances of a cluster (gw-04 membership!)
SDS	Secrets	TLS certs/keys (gw-07)

xDS is the gRPC (or REST) protocol Envoy uses to subscribe to these resources from a control plane. The control plane computes the desired state and streams updates; Envoy applies them without a restart. ADS (Aggregated Discovery Service) multiplexes all of them over one stream for ordering guarantees; Delta xDS sends only what changed.

2. Why does it matter?

The JD explicitly wants data-plane + control-plane fluency. This lab is that split. In a design interview, leading with "I'd separate a p99-optimized data plane from a correctness-optimized control plane that pushes config via something xDS-shaped" is the strongest possible opening (gw-00 INTERVIEW.md).
It's how you operate a fleet without redeploys. Routes, clusters, endpoints, certs, and resilience policy all change via control-plane push. Connection subsetting (gw-04) consumes EDS membership; cert rotation (gw-07) consumes SDS; canary routing (gw-12) is an RDS change. The control plane is the operational nerve center.
Config propagation is a distributed-consistency problem. Pushing config to a fleet has the same hazards as replicating a log (db-16…20): ordering (CDS must arrive before the EDS that references it, or Envoy drops endpoints — that's why ADS exists), partial rollout (some Envoys updated, some not), and thundering herds (a change that invalidates everything at once). Your consensus intuition transfers directly.
Envoy is the industry data plane. Istio, Gateway API implementations (Contour, Envoy Gateway), Consul, and many clouds run Envoy under the hood. Reading an Envoy config and reasoning about a filter's buffering/lifecycle is a baseline skill (the JD says you should be able to read C++).

3. How does it work?

The data-plane/control-plane loop

        ┌──────────────── CONTROL PLANE ────────────────┐
        │ source of truth (k8s CRDs / service discovery) │
        │            ↓ reconcile                          │
        │   desired state → SnapshotCache (versioned)     │
        │            ↓ xDS (gRPC stream, per-node)         │
        └───────────────────┬────────────────────────────┘
                            │  LDS/RDS/CDS/EDS/SDS
        ┌───────────────────▼────────────────────────────┐
   ───▶ │  Envoy fleet (DATA PLANE): apply config live     │ ───▶ origins
 client └──────────────────────────────────────────────────┘

The control plane watches a source of truth (Kubernetes objects in gw-10, or a service registry).
It computes a snapshot of desired resources (listeners, routes, clusters, endpoints) with a version.
Each Envoy opens a gRPC stream and subscribes; the control plane sends the resources for that Envoy's node ID.
Envoy ACKs the version it applied (or NACKs with an error if the config is invalid — a critical feedback signal).
On any change, the control plane bumps the version and pushes the delta; Envoy applies it without dropping connections.

The xDS request/response protocol (state-of-the-world)

Envoy → DiscoveryRequest { type_url, version_info, resource_names, response_nonce, [error_detail] }
CP    → DiscoveryResponse { type_url, version_info, resources[], nonce }
Envoy → DiscoveryRequest (ACK: echoes version_info == applied; same nonce)
         ...or NACK: version_info == LAST GOOD, error_detail set

The ACK/NACK + nonce dance is the heart of xDS correctness: the control plane knows exactly which version each Envoy is running and whether a push was rejected. This is a versioned, acknowledged replication protocol — recognizably the same shape as the log-matching/ commit machinery from db-17.

Why ADS (ordering) matters

If CDS (clusters) and EDS (endpoints) come on separate streams, EDS for cluster X might arrive before CDS defines X — Envoy can't place the endpoints and may drop traffic. ADS puts everything on one ordered stream so the control plane can guarantee "clusters before endpoints, routes before the listeners that reference them." This is literally a causal-ordering problem (db-16's logical clocks intuition).

Delta vs State-of-the-World

SotW: every update re-sends the full resource set for a type. Simple; expensive when the set is large and changes are small.
Delta xDS: sends only added/removed resources. Essential at scale (imagine EDS for 10,000 endpoints changing one at a time). The Netflix "on-demand cluster discovery" work is in this spirit — don't push everything to everyone; discover and push what's actually needed.

Reading an Envoy config (you'll be asked)

static_resources:
  listeners:
  - address: { socket_address: { address: 0.0.0.0, port_value: 8443 } }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          route_config:
            virtual_hosts:
            - domains: ["api.local"]
              routes:
              - match: { prefix: "/v1/play" }
                route: { cluster: playback }
          http_filters:
          - name: envoy.filters.http.router
  clusters:
  - name: playback
    type: EDS              # endpoints come from the control plane
    lb_policy: LEAST_REQUEST   # P2C (gw-06)
    outlier_detection: {...}    # gw-06
    circuit_breakers: {...}     # gw-06

Everything you built by hand in gw-03/gw-04/gw-06 has a declarative Envoy equivalent — and the control plane fills it in dynamically.

4. Core terminology

Term	Definition
Data plane	The proxies on the request path (Envoy); p99/throughput-optimized.
Control plane	The source of truth that computes + pushes config; correctness-optimized.
Envoy	The canonical C++ L4/L7 proxy; listeners→filter chains→clusters→endpoints.
xDS	The discovery protocol Envoy uses to fetch config (LDS/RDS/CDS/EDS/SDS).
LDS/RDS/CDS/EDS/SDS	Listener / Route / Cluster / Endpoint / Secret discovery services.
ADS	Aggregated Discovery Service: all xDS on one stream for ordering.
Delta xDS	Incremental updates (only changed resources).
Snapshot / version	A consistent set of resources at a version; pushed atomically.
ACK / NACK	Envoy confirming (or rejecting, with error) an applied config version.
node ID	Identifies an Envoy instance so the control plane sends it the right config.
go-control-plane	The Go library for building an xDS server.
On-demand discovery	Fetch/push only the resources actually needed (Netflix mesh).

5. Mental models

The control plane is the brain; the data plane is the muscle. The brain decides what should be true (routes, clusters, endpoints) and sends instructions; the muscle executes on the request path, fast and dumb. They fail differently: a brain outage means "config stops changing" (the muscle keeps running on last-known-good — a key resilience property); a muscle outage means "requests drop." Optimize each for its failure mode.
xDS is git pull for proxy config, with acknowledgments. Each Envoy subscribes and pulls the latest version; it reports back exactly which commit (version) it's on, and rejects (NACK) a bad commit without losing the last good one. The control plane always knows the fleet's deployed version distribution.
ADS ordering is "define the word before you use it." Sending endpoints for a cluster Envoy hasn't been told about is referencing an undefined variable. ADS guarantees declaration-before-use across resource types over one ordered stream.
Last-known-good is the data plane's seatbelt. If the control plane vanishes, Envoy keeps serving on the config it last applied. So a control-plane outage degrades change velocity, not availability — a deliberate, vital design choice you should call out.

6. Common misconceptions

"The control plane is on the request path." It must not be. If every request consulted the control plane, its availability would cap the data plane's. Config is pushed ahead of time and cached; the request path never waits on it. (Contrast with the push-registry lookup in gw-05, which is on the delivery path and is therefore a low-latency datastore, not a control plane.)
"xDS is just config files over the network." The ACK/NACK + versioning + ordering (ADS) make it a consistency protocol. Treating it as dumb file sync leads to the classic bugs: endpoints before clusters, partial fleet rollout, no visibility into who applied what.
"Push the new config to everyone at once." A simultaneous fleet-wide push is a thundering herd and a blast-radius maximizer. Stage it: validate, canary a few Envoys, watch their ACKs and metrics, then ramp (gw-12). xDS gives you the per-node ACK signal to do this safely.
"Envoy and the control plane are one product." They're deliberately separate; many control planes (Istio, Contour, Gloo, your own) drive the same Envoy. The interface is xDS. That separation is why you can build a custom control plane (this lab) without touching the data plane.
"NACKs are rare edge cases." A NACK means an Envoy rejected your config — it's running stale while you think you shipped. Unmonitored NACKs are a silent "my change didn't take" failure. Alert on them.

7. Interview talking points

"Design the control plane for a gateway fleet." Source of truth → reconcile to desired state → versioned snapshot per node → xDS push with ACK/NACK → last-known-good on control-plane outage → staged/canaried rollout. Emphasize it's off the request path and that config propagation is a consistency problem (cite ADS ordering).
"Why is the data/control-plane split the right architecture?" Different SLOs and failure modes: data plane optimized for p99 and availability (keeps serving on last-known-good); control plane optimized for correctness and safe propagation. You can scale, deploy, and reason about them independently.
"Walk me through xDS." Envoy subscribes per resource type (LDS/RDS/CDS/EDS/SDS) on a gRPC stream; control plane streams versioned resources; Envoy ACKs the applied version or NACKs with an error keeping last-known-good; nonce correlates responses; ADS aggregates for ordering; delta xDS for scale. It's a versioned, acknowledged replication protocol.
"Why does ADS exist?" Ordering across resource types: clusters before their endpoints, routes before the listeners that reference them. Separate streams race and cause transient drops. It's a causal-ordering guarantee — same family as the logical clocks in db-16.
"What happens to traffic if the control plane goes down?" Nothing immediately — Envoy serves on last-known-good config. You lose the ability to change config (new endpoints, routes, certs). So you make the control plane highly available for change velocity, but its outage isn't a data-plane outage. Design certs/SDS so they don't expire during a plausible control-plane outage.
"How would you safely roll out a config change to 80 clusters?" Validate → push to a canary set of Envoys (one node group) → watch ACKs (not NACKs) + RED metrics → ramp by node groups with automated rollback on SLO/NACK breach. The per-node ACK is your rollout signal.

8. Connections to other labs

gw-03 (API gateway) — Envoy's HCM + filter chain is the production twin of your hand-built gateway; routes are RDS.
gw-04 (connection management) — EDS supplies the endpoint membership your subsetting computes over; Envoy has built-in subset LB and per-worker pools.
gw-06 (resilience) — P2C (LEAST_REQUEST), outlier detection, circuit breakers, adaptive concurrency are all Envoy config pushed via xDS.
gw-07 (security) — SDS pushes certs/keys for hot rotation.
gw-09 (Kubernetes) — EndpointSlices are the EDS source; the control plane watches the K8s API.
gw-10 (Gateway API/operator) — the operator is a control plane: it reconciles Gateway/HTTPRoute CRDs into xDS (or vendor config).
db-16…20 (consensus) — config propagation ordering, versioning, and ACK are the same consistency problems you solved in the consensus phase.

gw-08 — The Hitchhiker's Guide to Envoy & the xDS Control Plane

Companion to CONCEPTS.md, with the runnable mini-xDS in src/go/xds/. The JD splits the work into "data plane and control plane" — this lab is that split, in code.

Envoy by itself proxies whatever its static config says; its power is being dynamically configured by a control plane over xDS. The control plane is the source of truth (which listeners/routes/clusters/ endpoints exist); the data plane is a fleet of Envoys that fetch and apply that config in real time. This lab models the protocol mechanics — versioned snapshots, ordering, ACK/NACK, last-known-good, and a reconcile loop — in stdlib, so they're legible. Production is envoyproxy/go-control-plane driving real Envoy; the concepts are identical.

Run bash scripts/verify.sh and watch the whole control loop:

reconcile #1 (initial):                  [envoy] applied v=5bfda895b0aa -> ACK
reconcile #2 (no change -> debounced):   [cp] desired state unchanged; nothing pushed
reconcile #3 (scale up -> new version):  [envoy] applied v=6c41278c9e89 -> ACK
reconcile #4 (inconsistent -> rejected): [cp] rejected: route "r" references undefined cluster "ghost"
final: applied=6c41278c9e89  current=6c41278c9e89  lastNACK=""

1. Resources and the snapshot (resource.go)

The four xDS resource families map to Envoy's model:

Type	xDS	meaning
`Listener`	LDS	a port + the route config it uses
`Route`	RDS	match (prefix) → cluster (the gw-03 route table)
`Cluster`	CDS	a logical upstream service
`Endpoints`	EDS	the concrete instances of a cluster (the gw-04 membership!)

A Snapshot bundles them at a Version and is pushed atomically — Envoy never sees a half-applied set.

Consistency = why ADS exists

Snapshot.Consistent() enforces the cross-resource invariants:

every Route references a defined Cluster,
every Listener references a defined Route,
every Endpoints references a defined Cluster.

TestSnapshotConsistency proves it rejects a route to a ghost cluster and a listener to a missing route. This is exactly why ADS (Aggregated Discovery Service) exists: if clusters and endpoints arrive on separate streams, EDS for cluster X can land before CDS defines X, and Envoy drops the endpoints. ADS puts everything on one ordered stream so the control plane can guarantee declaration-before-use — a causal-ordering problem, the same family as db-16's logical clocks.

Content-hash versioning

Snapshot.Fingerprint() hashes the resources order-independently (TestFingerprintStability: same resources in any add order → same hash; different resources → different hash). Versioning by content hash means identical desired state ⇒ identical version ⇒ no spurious push (debounce), and reconciles are idempotent across control-plane restarts. This is the same determinism discipline as the byte-identical dumps in the consensus phase (db-16…20).

2. The cache and the ACK/NACK state machine (cache.go)

SnapshotCache holds the latest snapshot per node and tracks, per node, the version each Envoy ACKed (or its last NACK). That tracking is the rollout signal: the control plane always knows the fleet's deployed version distribution.

TestAckNackFlow walks the protocol:

SetSnapshot(node, v1) validates + stores + pushes; Envoy receives it and Ack(node, v1) → AppliedVersion == v1.
SetSnapshot(node, v2) pushes v2, but Envoy Nack(node, v2, reason) (it failed local validation). The applied version stays at v1 — Envoy keeps serving last-known-good — and the NACK reason is recorded.
A later Ack clears the NACK.

Two maintainer-level truths fall out:

A control-plane outage is not a data-plane outage. Envoy serves on last-known-good config; you lose the ability to change config, not the ability to serve. TestSetSnapshotRejectsInconsistent shows a rejected push leaves the previous snapshot in force. Design certs/SDS so they don't expire during a plausible control-plane outage.
Unmonitored NACKs are a silent failure. A NACK means "Envoy rejected your change and is running stale while you think you shipped." LastNack is the metric you alert on (gw-11).

3. The reconcile loop (reconcile.go)

Reconciler is level-triggered: it derives the desired snapshot from a Source (the K8s API in gw-10, or a service registry), versions it by content hash, and pushes only when it changed and is consistent.

TestReconcileDebounceAndChange: first reconcile pushes; a second with identical desired state pushes nothing (debounce — vital under the EDS/ EndpointSlice churn of gw-09); a membership change pushes a new version. TestReconcileKeepsLastGoodOnError: an inconsistent desired state returns an error and pushes nothing — last-known-good remains. This is the exact same reconcile shape as the Kubernetes operator in gw-10; gw-10's operator is a control plane whose Source is the K8s API.

4. How it stitches the phase together

The Route/Cluster resources are gw-03's route table, pushed dynamically instead of hard-coded.
Endpoints (EDS) is the membership gw-04's subsetting ring consumes; its churn is why the reconcile loop debounces.
Resilience policy (gw-06: LB, outlier detection, circuit breakers) is Envoy cluster config the control plane pushes.
TLS certs (gw-07) are pushed via SDS, the same protocol.
In gw-10, a Kubernetes operator replaces the Source with watches on Gateway-API CRDs and EndpointSlices — and you have Envoy Gateway / Contour in miniature.

5. Hands-on

cd src/go
bash ../scripts/verify.sh        # tests + the control-loop demo

# Drive a real Envoy with go-control-plane (the production path) — see
# steps/01-minimal-xds-server.md for the bootstrap config and the
# envoyproxy/go-control-plane wiring. The mechanics you'd reason about
# (snapshot/version/ACK/NACK/ADS ordering) are exactly what this lab
# implements.

6. Exercises

Per-node canary snapshots: give a subset of node IDs a different snapshot version and verify (via AppliedVersion per node) that you can ramp a config change node-group by node-group (gw-12).
Delta xDS: extend the push to send only changed resources rather than the full set; measure the bytes saved when one of 10,000 endpoints changes.
On-demand discovery: only push clusters a node has actually referenced (the Netflix "zero-config service mesh" idea) instead of the whole world.
Drive real Envoy: wire go-control-plane's SnapshotCache and point an Envoy at it (steps/01); confirm curl localhost:9901/config_dump shows your version and that killing the control plane leaves traffic serving (last-known-good).
Alert on NACKs: expose LastNack/applied-version-skew as metrics (gw-11) and write the burn-rate-style alert for "N nodes failed to apply the latest config."

gw-08 — References

Envoy & xDS (the canon)

Envoy docs — architecture overview (listeners, filter chains, clusters, endpoints), the HTTP connection manager, and the xDS configuration model. The best free data-plane explanation anywhere. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/intro/arch_overview
xDS protocol spec — the DiscoveryRequest/Response, ACK/NACK, nonce, ADS, and Delta xDS semantics. https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol
xDS as a universal data-plane API — the cross-project effort to standardize xDS (gRPC also speaks xDS for proxyless LB). https://github.com/cncf/xds

Build a control plane

envoyproxy/go-control-plane — the Go xDS server library used in the steps; read the cache.SnapshotCache and the example server. https://github.com/envoyproxy/go-control-plane
Envoy "Dynamic configuration (control plane)" sandbox — a runnable example wiring Envoy to a go-control-plane server. https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/dynamic-configuration-control-plane

Production control planes to study

Istio istiod — the most-deployed Envoy control plane; read how it translates K8s + its CRDs into xDS.
Contour / Envoy Gateway — Gateway-API → Envoy control planes (gw-10). https://github.com/envoyproxy/gateway
Netflix — Zero Configuration Service Mesh with On-Demand Cluster Discovery (the lazy/on-demand xDS direction). https://netflixtechblog.com/zero-configuration-service-mesh-with-on-demand-cluster-discovery-ac6483b52a51

Tooling

func-e / getenvoy — run Envoy locally without a container.
envoy --config-yaml ... --log-level debug — watch xDS subscribe/ACK.
curl localhost:9901/config_dump — Envoy admin: see the applied config (and /clusters, /stats, /server_info).

Cross-lab dependencies

Upstream: gw-03 (the gateway model Envoy implements), gw-04 (EDS membership), gw-06 (resilience config), gw-07 (SDS), db-16…20 (consistency intuition for config propagation).
Downstream: gw-09 (EndpointSlices as the EDS source), gw-10 (the operator as a control plane), gw-12 (staged config rollout).

gw-08 — Analysis

The design review for the xDS control plane you build in steps/.

Required behaviors

Off the request path. The control plane computes and pushes config; Envoy serves from cached config. A control-plane outage must not drop traffic (last-known-good).
Versioned, atomic snapshots. Each push is a consistent set at a version; Envoy never sees a half-applied snapshot.
Ordering. Within a snapshot, clusters are consistent with the endpoints and routes that reference them (ADS guarantees this).
ACK/NACK visibility. The control plane records which version each node applied and surfaces NACKs (rejected config) as alerts.
Per-node config. Different Envoys (by node ID) can get different snapshots (canary vs baseline) for staged rollout.

Design decisions

SnapshotCache keyed by node ID. go-control-plane's cache holds one versioned snapshot per node group. Setting a new snapshot triggers the push; the library handles the stream, ACK/NACK, and nonce. The lab focuses on computing good snapshots.
ADS (single aggregated stream). The lab uses ADS so cluster/ endpoint/route ordering is correct by construction — demonstrating why it matters by first (optionally) showing the SotW race without it.
A tiny reconcile loop. A goroutine watches a mock "source of truth" (a file or an in-memory registry that you mutate) and rebuilds the snapshot on change, bumping the version. This previews the Kubernetes controller in gw-10 (same shape: watch → reconcile → desired state).
Consistent versioning. The version string is derived from a hash of the desired state, so identical desired state ⇒ identical version ⇒ no spurious pushes (debouncing). This mirrors the determinism discipline from the consensus phase.

Tradeoffs worth flagging

SotW simplicity vs Delta scale. State-of-the-world re-sends the full set per type; fine for small fleets/resource sets, costly when EDS has thousands of endpoints churning. Delta xDS is the scale answer; the lab notes where it would plug in. Netflix's on-demand discovery is the "don't push everything to everyone" extreme.
Push-all vs staged rollout. Bumping every node's snapshot at once maximizes blast radius. Per-node snapshots let you canary, at the cost of managing more snapshot state and a rollout controller (gw-12).
Reconcile debounce vs freshness. Rebuilding on every source change can thrash under rapid churn (pod flapping in gw-09); debouncing adds latency to legitimate changes. Tune the debounce window; it's the same trade-off as re-subsetting in gw-04.
Control-plane HA. The control plane should be replicated; but because it's off the request path, its consistency model can be looser than the data plane's availability requirement. A short control-plane outage is tolerable; a data-plane outage is not.

What production adds beyond this lab

Delta xDS and on-demand resource discovery (push only what a node needs) for fleet/endpoint scale.
A real source of truth: the Kubernetes API (gw-10) or a service registry, with watches and resync.
SDS for certs (gw-07) and runtime/RTDS for feature flags.
Rollout safety: per-node-group canaries, automated rollback on NACK or SLO breach, and a config-validation gate before any push (gw-12).
Deep observability: per-node applied version, NACK rate + reasons, push latency, snapshot size, ADS stream health (gw-11).

gw-08 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline). Optional for the real-Envoy path: func-e/getenvoy + go-control-plane (see steps/01).

One-shot

cd gw-08-envoy-xds-control-plane && bash scripts/verify.sh   # tests + control-loop demo

Per-language workflow (Go)

cd gw-08-envoy-xds-control-plane/src/go
go test -race -count=1 ./...        # snapshot consistency, fingerprint, ACK/NACK, reconcile
go run ./cmd/xdsdemo                # reconcile + ACK + debounce + reject-keeps-last-good

Package map

File	What
`xds/resource.go`	LDS/RDS/CDS/EDS resources, snapshot, consistency (ADS ordering), content-hash version
`xds/cache.go`	snapshot cache per node, subscribe/push, ACK/NACK + last-known-good tracking
`xds/reconcile.go`	level-triggered reconcile loop with debounce + keep-last-good-on-error
`cmd/xdsdemo`	the control-loop demonstration

Driving real Envoy

steps/01-minimal-xds-server.md shows the go-control-plane wiring and the Envoy bootstrap to point a real proxy at an xDS server. The protocol mechanics there (snapshot/version/ACK/NACK/ADS) are exactly what this lab's xds package implements.

See GUIDE.md for the deep dive.

gw-08 — Verification

One command

cd gw-08-envoy-xds-control-plane && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestSnapshotConsistency`	routes→clusters and listeners→routes must resolve (the ADS-ordering invariant); a good snapshot is consistent
`TestFingerprintStability`	content hash is order-independent; identical resources hash identically, different ones differ
`TestSubscribePushesCurrent`	subscribing receives the current snapshot immediately
`TestAckNackFlow`	ACK advances the applied version; NACK keeps last-known-good and records the reason; a later ACK clears it
`TestSetSnapshotRejectsInconsistent`	an inconsistent push is rejected and the previous snapshot stays in force
`TestReconcileDebounceAndChange`	a no-op reconcile pushes nothing; a membership change pushes a new version
`TestReconcileKeepsLastGoodOnError`	an inconsistent desired state errors without pushing; last-known-good remains

All under -race.

Demo (xdsdemo, in verify.sh)

Shows: initial push + ACK, a debounced no-op, a scale-up new version, and an inconsistent push rejected with last-known-good retained.

What "green" does NOT guarantee

Not real xDS-over-gRPC. This models the protocol mechanics in stdlib; driving an actual Envoy uses go-control-plane (steps/01).
State-of-the-world only. Delta xDS and on-demand discovery are exercises (GUIDE §6.2/§6.3).
No per-node canary in the cache demo. Staged rollout by node group is an exercise (GUIDE §6.1) and the basis of gw-12.

gw-08 step 01 — A minimal xDS control plane driving a real Envoy

Goal

Build an xDS server with go-control-plane that serves a listener, route, cluster, and endpoints to a real Envoy, and watch config flow from your Go process into the live proxy.

Code — `src/go/snapshot.go`

package cp

import (
	cluster "github.com/envoyproxy/go-control-plane/envoy/config/cluster/v3"
	core "github.com/envoyproxy/go-control-plane/envoy/config/core/v3"
	endpoint "github.com/envoyproxy/go-control-plane/envoy/config/endpoint/v3"
	listener "github.com/envoyproxy/go-control-plane/envoy/config/listener/v3"
	route "github.com/envoyproxy/go-control-plane/envoy/config/route/v3"
	"github.com/envoyproxy/go-control-plane/pkg/cache/types"
	"github.com/envoyproxy/go-control-plane/pkg/cache/v3"
	"github.com/envoyproxy/go-control-plane/pkg/resource/v3"
)

// MakeSnapshot builds a consistent set of {cluster, endpoints, route,
// listener} at a version. Everything references "playback".
func MakeSnapshot(version string, endpoints []string) *cache.Snapshot {
	const clusterName = "playback"

	cl := &cluster.Cluster{
		Name:                 clusterName,
		ClusterDiscoveryType: &cluster.Cluster_Type{Type: cluster.Cluster_EDS},
		EdsClusterConfig: &cluster.Cluster_EdsClusterConfig{
			EdsConfig: &core.ConfigSource{ // endpoints come via EDS (ADS)
				ConfigSourceSpecifier: &core.ConfigSource_Ads{Ads: &core.AggregatedConfigSource{}},
			},
		},
		LbPolicy: cluster.Cluster_LEAST_REQUEST, // P2C (gw-06)
	}

	eds := makeEndpoints(clusterName, endpoints) // []*endpoint.ClusterLoadAssignment
	rt := makeRoute("local_route", clusterName)  // *route.RouteConfiguration
	ls := makeHTTPListener("listener_0", "local_route", 10000) // *listener.Listener

	snap, _ := cache.NewSnapshot(version, map[resource.Type][]types.Resource{
		resource.ClusterType:  {cl},
		resource.EndpointType: {eds},
		resource.RouteType:    {rt},
		resource.ListenerType: {ls},
	})
	return snap
}

func makeEndpoints(clusterName string, addrs []string) *endpoint.ClusterLoadAssignment {
	var lbs []*endpoint.LbEndpoint
	for _, a := range addrs {
		host, port := splitHostPort(a)
		lbs = append(lbs, &endpoint.LbEndpoint{
			HostIdentifier: &endpoint.LbEndpoint_Endpoint{Endpoint: &endpoint.Endpoint{
				Address: &core.Address{Address: &core.Address_SocketAddress{
					SocketAddress: &core.SocketAddress{
						Address:       host,
						PortSpecifier: &core.SocketAddress_PortValue{PortValue: port},
					}}},
			}},
		})
	}
	return &endpoint.ClusterLoadAssignment{
		ClusterName: clusterName,
		Endpoints:   []*endpoint.LocalityLbEndpoints{{LbEndpoints: lbs}},
	}
}

Code — `src/go/server.go` (the xDS gRPC server)

package cp

import (
	"context"
	"net"

	clusterservice "github.com/envoyproxy/go-control-plane/envoy/service/cluster/v3"
	discovery "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
	endpointservice "github.com/envoyproxy/go-control-plane/envoy/service/endpoint/v3"
	listenerservice "github.com/envoyproxy/go-control-plane/envoy/service/listener/v3"
	routeservice "github.com/envoyproxy/go-control-plane/envoy/service/route/v3"
	"github.com/envoyproxy/go-control-plane/pkg/cache/v3"
	"github.com/envoyproxy/go-control-plane/pkg/server/v3"
	"google.golang.org/grpc"
)

func Run(ctx context.Context, snapCache cache.SnapshotCache, addr string) error {
	srv := server.NewServer(ctx, snapCache, nil)
	grpcServer := grpc.NewServer()

	// Register the ADS endpoint (aggregates all xDS on one stream).
	discovery.RegisterAggregatedDiscoveryServiceServer(grpcServer, srv)
	// (Per-type services can also be registered for non-ADS Envoys.)
	endpointservice.RegisterEndpointDiscoveryServiceServer(grpcServer, srv)
	clusterservice.RegisterClusterDiscoveryServiceServer(grpcServer, srv)
	routeservice.RegisterRouteDiscoveryServiceServer(grpcServer, srv)
	listenerservice.RegisterListenerDiscoveryServiceServer(grpcServer, srv)

	lis, err := net.Listen("tcp", addr)
	if err != nil {
		return err
	}
	return grpcServer.Serve(lis)
}

Code — `main.go`

func main() {
	snapCache := cache.NewSnapshotCache(true, cache.IDHash{}, nil) // ADS=true
	const nodeID = "edge-envoy-1"

	snap := cp.MakeSnapshot("v1", []string{"127.0.0.1:9001", "127.0.0.1:9002"})
	_ = snap.Consistent() // verify clusters/endpoints/routes line up
	_ = snapCache.SetSnapshot(context.Background(), nodeID, snap)

	cp.Run(context.Background(), snapCache, ":18000")
}

Envoy bootstrap (points Envoy at your control plane)

envoy-bootstrap.yaml:

node: { id: edge-envoy-1, cluster: edge }
dynamic_resources:
  ads_config:
    api_type: GRPC
    transport_api_version: V3
    grpc_services: [{ envoy_grpc: { cluster_name: xds_cluster } }]
  cds_config: { ads: {}, resource_api_version: V3 }
  lds_config: { ads: {}, resource_api_version: V3 }
static_resources:
  clusters:
  - name: xds_cluster
    type: STRICT_DNS
    typed_extension_protocol_options:   # h2 to the control plane
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config: { http2_protocol_options: {} }
    load_assignment:
      cluster_name: xds_cluster
      endpoints: [{ lb_endpoints: [{ endpoint: { address: { socket_address: { address: 127.0.0.1, port_value: 18000 }}}}]}]
admin: { address: { socket_address: { address: 127.0.0.1, port_value: 9901 }}}

go run .                                   # control plane on :18000
func-e run -c envoy-bootstrap.yaml         # Envoy fetches config via ADS
# two mock origins:
python3 -m http.server 9001 & python3 -m http.server 9002 &

# Traffic flows through Envoy (listener on :10000) to the origins:
curl -v localhost:10000/

# See what Envoy ACTUALLY applied (and the version it ACKed):
curl -s localhost:9901/config_dump | jq '.. | .version_info? // empty' | sort -u
curl -s localhost:9901/clusters     # endpoints Envoy got via EDS

Tasks

Build the control plane, boot Envoy against it, and curl through Envoy to the origins — config came entirely from your Go process.
In Envoy's /config_dump, find the version_info and confirm it matches the "v1" you set (Envoy ACKed it).
Break the snapshot on purpose (route references a nonexistent cluster); confirm snap.Consistent() catches it, or that Envoy NACKs and /config_dump stays on the last good version.

Acceptance

A request through Envoy reaches an origin using only dynamically- pushed config (no static routes/clusters).
You can read the applied version from Envoy's admin and tie it to your snapshot version.
A bad snapshot is rejected (consistency check or NACK), and Envoy keeps serving last-known-good.

Discussion prompts

Where is the control plane relative to the request path, and what happens to live traffic if you kill your Go process? (Last-known- good.)
Why does the snapshot bundle clusters + endpoints + routes + listeners at one version? What goes wrong if endpoints update without the cluster?
What does a NACK mean operationally, and why must you alert on it?

gw-08 step 02 — A reconcile loop and live endpoint updates (EDS)

Goal

Turn the static snapshot into a living control plane: a reconcile loop watches a source of truth, recomputes the desired snapshot on change, and pushes new endpoints to Envoy with no dropped requests — the operational pattern behind every gateway fleet (and the Kubernetes operator in gw-10).

Code — the reconcile loop

package cp

import (
	"context"
	"crypto/sha256"
	"encoding/hex"
	"sort"
	"time"

	"github.com/envoyproxy/go-control-plane/pkg/cache/v3"
)

// Source is the desired state (in prod: the K8s API or a service
// registry; here: a value you mutate).
type Source interface {
	Endpoints() []string // current healthy endpoints for "playback"
}

// Reconciler watches the source and pushes a new snapshot when desired
// state changes. Version = hash(desired state), so identical state never
// triggers a spurious push (debounce by content).
type Reconciler struct {
	Cache  cache.SnapshotCache
	NodeID string
	Source Source

	lastVersion string
}

func (r *Reconciler) Run(ctx context.Context) {
	t := time.NewTicker(500 * time.Millisecond) // or a real watch/event
	defer t.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case <-t.C:
			r.reconcileOnce(ctx)
		}
	}
}

func (r *Reconciler) reconcileOnce(ctx context.Context) {
	eps := r.Source.Endpoints()
	sort.Strings(eps) // deterministic -> stable version for same set
	version := hashVersion(eps)
	if version == r.lastVersion {
		return // no change: no push (debounce)
	}
	snap := MakeSnapshot(version, eps)
	if err := snap.Consistent(); err != nil {
		// Never push an inconsistent snapshot — keep last-known-good.
		return
	}
	if err := r.Cache.SetSnapshot(ctx, r.NodeID, snap); err != nil {
		return
	}
	r.lastVersion = version
}

func hashVersion(eps []string) string {
	h := sha256.New()
	for _, e := range eps {
		h.Write([]byte(e))
		h.Write([]byte{0})
	}
	return hex.EncodeToString(h.Sum(nil))[:12]
}

The experiment — add/remove an endpoint live

type mutableSource struct{ mu sync.Mutex; eps []string }
func (s *mutableSource) Endpoints() []string { s.mu.Lock(); defer s.mu.Unlock(); return append([]string{}, s.eps...) }
func (s *mutableSource) set(eps []string)    { s.mu.Lock(); s.eps = eps; s.mu.Unlock() }

func main() {
	src := &mutableSource{eps: []string{"127.0.0.1:9001"}}
	snapCache := cache.NewSnapshotCache(true, cache.IDHash{}, nil)
	r := &cp.Reconciler{Cache: snapCache, NodeID: "edge-envoy-1", Source: src}
	go r.Run(context.Background())
	go cp.Run(context.Background(), snapCache, ":18000")

	// Later: scale up -> a new endpoint appears in the source -> EDS push.
	time.Sleep(10 * time.Second)
	src.set([]string{"127.0.0.1:9001", "127.0.0.1:9002"}) // reconcile pushes v2
}

Drive continuous load through Envoy while you add/remove endpoints:

wrk -t2 -c50 -d60s http://127.0.0.1:10000/ &
# during the run, mutate the source (add :9002, then remove :9001)
watch -n1 'curl -s localhost:9901/clusters | grep playback'  # endpoints change live

Tasks

Implement the reconcile loop with content-hash versioning. Add an endpoint to the source and confirm Envoy's /clusters shows it within a reconcile tick — with zero errors in the wrk run.
Remove an endpoint; confirm Envoy stops sending it traffic gracefully (existing requests finish; new ones avoid it).
Show the debounce: mutate the source to the same set; confirm no new version is pushed (the hash is unchanged).
Tie it to gw-04: this EDS membership is exactly what your subsetting ring consumes. Rapidly flap an endpoint and discuss debouncing the re-subset so the fix doesn't cause churn.

Acceptance

Live endpoint add/remove reflected in Envoy with no dropped requests.
Content-hash versioning prevents spurious pushes (no-op reconciles don't bump the version).
A correct statement of how this feeds gw-04 subsetting and gw-10's operator.

Discussion prompts

Why hash the desired state for the version instead of a counter? (Same state ⇒ same version ⇒ idempotent reconcile; survives control-plane restarts without spurious pushes.)
Rapid pod churn (gw-09) makes the source flap. How do you debounce so Envoy isn't pushed thousands of EDS updates/sec, without making legitimate scale-ups too slow?
This loop is exactly a Kubernetes controller's reconcile (gw-10): watch desired state → compute → converge. What does controller-runtime add (work queue, rate limiting, leader election) and why?

gw-09 — Kubernetes Networking Internals: CNI, kube-proxy, Services

The JD requires a "solid understanding of Kubernetes internals (Networking, CNI, CRDs, and Operator patterns)." This lab covers the networking half; gw-10 covers CRDs/operators. The gateway fleet runs on Kubernetes (Netflix's "Managing Netflix's Compute with Kubernetes" talk is the Titus→K8s story), so a Cloud Gateway engineer must understand how a packet actually reaches a pod, how Services and EndpointSlices work (they're the membership source for gw-04/gw-08), and how readiness gates the graceful drain you built in gw-01/gw-05.

This is the layer where the abstractions (Service, Pod, Endpoint) meet the kernel (veth pairs, iptables/IPVS/eBPF, conntrack). When a gateway has a mysterious latency or connection problem, the answer is often here — in kube-proxy rules, conntrack table limits, or a readiness race during a deploy.

1. What is it?

Kubernetes networking rests on a few hard rules and a stack of components that implement them. The Kubernetes network model:

Every pod gets its own IP (no NAT between pods).
Pods can reach every other pod directly by IP, across nodes, without NAT.
Agents on a node (kubelet, etc.) can reach pods on that node.

How those rules are realized:

 Service (stable virtual IP + DNS name)              ← the abstraction
    │  selects pods by label → EndpointSlices (the real pod IPs)
    ▼
 kube-proxy (or Cilium/eBPF) programs the node       ← the dataplane glue
    │  iptables / IPVS / eBPF maps ServiceIP → a pod IP
    ▼
 CNI plugin wires each pod's network namespace        ← the pod plumbing
    │  veth pair: pod netns ↔ node bridge/overlay
    ▼
 the packet reaches the pod (your gateway / origin)

CNI (Container Network Interface) is the spec + plugins that give a pod its network: create the pod's network namespace, a veth pair (one end in the pod, one on the node), assign an IP (IPAM), and set up routing/overlay so cross-node traffic works (Calico, Cilium, AWS VPC CNI, flannel...).
Service is a stable virtual IP (ClusterIP) + DNS name fronting a set of pods. EndpointSlices (the scalable successor to Endpoints) list the actual ready pod IPs behind a Service.
kube-proxy programs the node so traffic to a ClusterIP is load-balanced to a backing pod IP — via iptables (rule chains), IPVS (kernel L4 LB, scales better), or it's replaced entirely by eBPF (Cilium) for performance and observability.

2. Why does it matter?

The gateway runs here, and so do its origins. Drain (gw-01, gw-05) depends on readiness probes + EndpointSlice updates + terminationGracePeriodSeconds + preStop. Get the ordering wrong and every deploy drops connections. This is the operational glue for everything in the phase.
EndpointSlices are the membership source. The subsetting in gw-04 and the EDS in gw-08 ultimately read pod IPs from EndpointSlices. Their churn (pods coming/going on every deploy/scale) is what makes stability-under-change a requirement.
"Mysterious" gateway problems live in this layer. conntrack table exhaustion (a busy gateway opens huge numbers of flows), iptables rule bloat at thousands of Services (kube-proxy sync latency), NodePort/ externalTrafficPolicy quirks dropping the client IP, MTU mismatches on overlays causing fragmentation/black-holes. The JD's "root cause using data" maps directly to debugging these.
Netflix moved compute to Kubernetes. Understanding how custom controllers, container runtime customization (NRI/OCI hooks), and the CNI fit together is the context for "Managing Netflix's Compute" and the NRI talk — and for running a gateway fleet on K8s.

3. How does it work?

A packet's journey to a pod

client → Service DNS (kube-dns/CoreDNS) → ClusterIP (virtual)
   → kube-proxy rule (iptables/IPVS/eBPF) rewrites dst to a pod IP (DNAT)
   → routed to the node hosting that pod (overlay/underlay via CNI)
   → node's veth/bridge → pod's veth → pod netns → container :port
   ← conntrack remembers the mapping so replies are un-DNAT'd correctly

The conntrack table is load-bearing: every connection through a ClusterIP creates a conntrack entry; a busy gateway can exhaust nf_conntrack_max and start dropping connections — a classic outage with a one-line sysctl fix once you know to look.

CNI in detail

When the kubelet starts a pod, it calls the CNI plugin with ADD:

1. create the pod's network namespace
2. create a veth pair: vethXXXX (node side) <-> eth0 (pod side)
3. IPAM: allocate a pod IP from the node's CIDR
4. wire the node side into a bridge / set routes / program eBPF
5. for overlays: encapsulate cross-node traffic (VXLAN/Geneve) or use
   native routing (Calico BGP, AWS VPC CNI = real VPC IPs)
6. return the assigned IP to kubelet

DEL tears it down on pod stop. The CNI spec is deliberately tiny — a binary + JSON config — which is why there are so many implementations.

kube-proxy modes

Mode	How	Trade-off
iptables	one chain per Service/endpoint; random/probability DNAT	simple, ubiquitous; rule count and sync time grow O(Services×endpoints) — slow at large scale
IPVS	kernel L4 LB (hash tables), real LB algos	scales to many Services; needs the IPVS kernel modules
eBPF (Cilium)	replace kube-proxy with eBPF programs at the socket/XDP layer	fastest, best observability, bypasses iptables; newer, more complex

Services, EndpointSlices, and readiness

A readiness probe failing removes the pod from its Service's EndpointSlices → kube-proxy stops sending it new traffic. This is the hook that makes graceful drain work (gw-01/gw-05): flip readiness first, then drain.
EndpointSlices shard endpoints into chunks (default ≤100 each) so a 1000-pod Service doesn't ship one giant object on every change — a scalability fix over the old Endpoints object.
externalTrafficPolicy: Local keeps the client source IP (no extra hop/SNAT) but only routes to pods on the receiving node — a real trade-off for an edge LB (preserve IP vs even spread).

The drain ordering you must get right

pod deletion:
  1. pod marked Terminating; removed from EndpointSlices (async!)
  2. preStop hook runs (e.g. flip readiness, start app drain — gw-01/gw-05)
  3. SIGTERM sent to the container
  4. grace period (terminationGracePeriodSeconds) counts down
  5. SIGKILL if still alive

The race: EndpointSlice removal (step 1) is eventually consistent across all kube-proxies — some nodes may still send traffic for a moment. So your app must keep serving during preStop/grace, not exit on the first SIGTERM. This is exactly the gw-01 "fail readiness, then drain with a deadline" pattern, and why a high-density WebSocket node (gw-05) needs a long grace period.

4. Core terminology

Term	Definition
Pod IP	Every pod's own routable IP; no NAT between pods.
CNI	Container Network Interface: spec + plugins that wire pod networking.
veth pair	A virtual cable: one end in the pod netns, one on the node.
IPAM	IP address management — allocating pod IPs from a CIDR.
Overlay	Encapsulated pod network across nodes (VXLAN/Geneve); vs native routing.
Service / ClusterIP	Stable virtual IP + DNS fronting a set of pods.
EndpointSlice	Sharded list of ready pod IPs behind a Service (membership source).
kube-proxy	Programs nodes to map ServiceIP → pod IP (iptables/IPVS/eBPF).
conntrack	Kernel connection-tracking table; finite, exhaustible on a busy gateway.
Readiness probe	Health check that gates Service membership; the drain hook.
externalTrafficPolicy	`Cluster` (even spread, SNAT, loses client IP) vs `Local` (keeps IP, node-local pods).
NetworkPolicy	L3/L4 (and with CNI extensions, L7) firewall rules between pods.
eBPF	In-kernel programmable dataplane (Cilium) replacing iptables.
NRI / OCI hooks	Container-runtime extension points to customize networking/storage per workload (Netflix talk).

5. Mental models

A Service is a phone extension; EndpointSlices are the directory; kube-proxy is the switchboard. You dial the extension (ClusterIP); the switchboard looks up who's currently at that extension (EndpointSlices) and connects you to one of them (DNAT). Readiness is someone marking themselves "do not connect new calls."
CNI is the patch panel for pods. Each pod gets a virtual cable (veth) from its private room (netns) to the building's network. The CNI plugin is the technician who runs the cable, assigns the jack (IP), and decides whether floors are connected by real wiring (native routing) or tunnels (overlay).
conntrack is the gateway's hidden quota. Every flow takes a slot; the table is finite. A connection-heavy gateway (especially with churn — gw-04) silently fills it and starts dropping, looking like a random network failure until you check nf_conntrack_count vs _max.
Readiness-then-drain is leaving a party politely. You tell the host you're heading out (fail readiness → removed from rotation), finish your current conversation (drain in-flight), then leave (exit). Leaving mid-sentence (exit on first SIGTERM) drops whoever you were talking to.

6. Common misconceptions

"Service load balancing is real load balancing." kube-proxy does L4, connection-level, roughly-even DNAT — not request-aware, latency-aware, or L7. It pins a whole connection (so h2/gRPC hit the multiplexing trap, gw-02) and doesn't do P2C/outlier ejection (gw-06). Real LB is the gateway's job, not kube-proxy's.
"A pod is removed from rotation the instant it's deleted." EndpointSlice propagation to every kube-proxy is eventually consistent; there's a window where traffic still arrives. Hence preStop + grace + keep-serving-during-drain.
"All CNIs are equivalent." They differ enormously: overlay vs native routing (latency, MTU), IPAM scale, NetworkPolicy support, eBPF vs iptables dataplane, cloud integration (real VPC IPs vs encapsulation). The choice shapes performance and observability.
"iptables kube-proxy is fine at any scale." Rule count and sync latency grow with Services×endpoints; at thousands of Services, programming latency and packet-path cost become real. IPVS or eBPF are the scale answers.
"externalTrafficPolicy doesn't matter." Cluster SNATs and hides the client IP (bad for an edge that needs it — gw-01 PROXY protocol / gw-07 identity); Local preserves it but can imbalance load. It's a deliberate edge decision.

7. Interview talking points

"Trace a packet from a client to a pod." Service DNS → ClusterIP → kube-proxy DNAT (iptables/IPVS/eBPF) to a pod IP → routed via CNI (overlay or native) to the node → veth into the pod netns → container; conntrack tracks the mapping for the return path. Naming conntrack and the kube-proxy modes signals depth.
"How do you drain a gateway pod without dropping requests on Kubernetes?" preStop hook flips readiness (and triggers app drain) → pod removed from EndpointSlices (eventually) → keep serving in-flight during the grace period → exit before SIGKILL. Set terminationGracePeriodSeconds long enough (seconds for L7, minutes for a 200k-connection WebSocket node — gw-05). Acknowledge the EndpointSlice propagation race.
"What's the gotcha with gRPC/HTTP-2 behind a ClusterIP?" kube-proxy is connection-level L4, so one long-lived h2 connection pins all its streams to one pod (gw-02's multiplexing trap). You need L7 LB (a real proxy / mesh / headless Service + client-side LB), not bare ClusterIP.
"A busy gateway is randomly dropping connections — where do you look?" conntrack exhaustion (nf_conntrack_count near _max), ephemeral-port exhaustion, kube-proxy sync latency / rule bloat, readiness flapping. Reduce flows with connection reuse (gw-04); raise limits as a stopgap. Data-driven, exactly as the JD asks.
"iptables vs IPVS vs eBPF for kube-proxy?" iptables: simple, ubiquitous, O(n) rule growth → slow at scale. IPVS: kernel hash-table L4 LB, scales to many Services. eBPF (Cilium): fastest, best observability, bypasses iptables, replaces kube-proxy. Pick by scale and observability needs.
"How does the gateway learn its origins' addresses?" EndpointSlices (watched by the control plane / mesh) → EDS (gw-08) → the gateway's cluster membership → subsetting (gw-04). Pod churn drives the stability requirement.

8. Connections to other labs

gw-01 / gw-05 (drain) — readiness + preStop + grace period are the Kubernetes machinery that makes graceful drain actually work.
gw-04 (connection management) — EndpointSlices are the membership the subsetting ring consumes; pod churn is why stability matters; conntrack limits are the L4 ceiling.
gw-06 (resilience) — kube-proxy is not a substitute for real LB; this lab explains why you still need P2C/outlier ejection.
gw-08 (Envoy/xDS) — the control plane watches the K8s API (Services/EndpointSlices) to compute EDS.
gw-10 (Gateway API/operator) — CRDs + the operator pattern, built on the same API machinery; the Gateway API replaces Service-type LoadBalancer/Ingress for L7.
gw-12 (migration) — running the gateway fleet on K8s, NRI/OCI runtime customization, and the Titus→K8s migration story.

gw-09 — The Hitchhiker's Guide to Kubernetes Networking Internals

Companion to CONCEPTS.md, with runnable simulations in src/go/k8snet/. The gateway fleet runs on Kubernetes (Netflix's Titus→K8s journey); these are the two networking behaviors that actually page you.

You can't spin up a real cluster in a unit test, and reading the Kubernetes docs doesn't build the reflex you need on call. So this lab simulates the two highest-impact behaviors deterministically: the drain / EndpointSlice propagation race and conntrack exhaustion. Run bash scripts/verify.sh:

drain / EndpointSlice race (3 kube-proxies; 10 req/proxy/tick):
  WRONG (exit first):           100 requests dropped
  RIGHT (fail readiness first):   0 requests dropped
  RIGHT but grace too short:    400 requests dropped (SIGKILL mid-propagation)
conntrack exhaustion (table size 100, 1000 requests):
  churn (new connection each):  900 packets dropped
  keep-alive (reuse 1 flow):      0 packets dropped

Those two blocks are the most common self-inflicted gateway outages on Kubernetes. Internalize them and you can debug "every deploy drops a few requests" and "the node randomly drops connections under load" — two incidents that otherwise eat a day each.

1. The drain race (drain.go)

A Service is fronted by many kube-proxies, each with its own local view of the endpoints, updated from EndpointSlices. Removal is eventually consistent: when a pod leaves, each proxy stops routing to it only after it observes the update — and they don't all observe it at the same instant.

SimulateDrain models three kube-proxies with propagation delays [2,3,5] ticks:

WRONG (readinessFirst=false) — the pod exits immediately, but the proxies keep routing to it for their propagation delays. Result: (2+3+5)×10 = 100 requests dropped (TestDrainRaceDropsWithoutReadiness). This is the "every deploy drops a few requests" bug, and it's why a naive sleep 30 && exit doesn't fix anything — sleeping doesn't make the LB stop sending you traffic.
RIGHT (readinessFirst=true) — readiness fails first, so removal starts propagating while the pod keeps serving; it only exits after every proxy has stopped routing to it (within the grace period). Result: 0 dropped (TestDrainNoDropWithReadinessFirst). The ordering is always: fail readiness → drain → exit. This is the Kubernetes form of the gw-01 drain ordering, made precise.
RIGHT but grace too short — TestDrainGraceTooShort adds a proxy that takes 50 ticks to update with only a 10-tick grace: the pod is SIGKILLed mid-propagation and the slow proxy's traffic drops. This is the gw-05 problem: a high-density WebSocket node needs a grace period (terminationGracePeriodSeconds) of minutes; the default 30s would SIGKILL 200k connections mid-migration. Same model, different timeout.

In a real pod, "fail readiness first" is a preStop hook that flips the readiness probe (and triggers your app drain — gw-01/gw-05), with terminationGracePeriodSeconds sized to cover propagation + in-flight-request completion.

EndpointSlice/Shard model the other scalability detail: endpoints are sharded (~100 per slice) so a 1000-pod Service doesn't ship one giant object on every change (TestEndpointSliceSharding). Those slices are the membership the control plane turns into EDS (gw-08) and the gw-04 subset ring consumes.

2. conntrack exhaustion (conntrack.go)

On nodes that do netfilter/NAT (most Kubernetes setups), every connection consumes a slot in the kernel's nf_conntrack table. Conntrack models it: Track(flow) reuses an existing flow for free but consumes a slot for a new flow, dropping when full.

SimulateConntrack(100, 1000, ...):

churn (a new connection per request) → 1000 new flows into a 100-slot table → 900 drops (TestConntrackExhaustionUnderChurn). The symptom is "the node randomly drops connections under load" with a cryptic nf_conntrack: table full in dmesg — easy to misdiagnose as a network problem.
keep-alive (reuse one flow) → 0 drops (TestConntrackOkWithKeepAlive).

The punchline ties the phase together: connection reuse (gw-04) is not just a CPU/latency optimization — it's also what keeps you under the conntrack ceiling. The durable fix for conntrack exhaustion is fewer connections (pool + keep-alive + h2 multiplexing); raising nf_conntrack_max is the stopgap.

3. The rest of the model (read CONCEPTS for depth)

The simulations cover the two incident-grade behaviors; CONCEPTS.md covers the full picture you must be able to narrate:

The packet path: Service DNS → ClusterIP → kube-proxy DNAT (iptables / IPVS / eBPF) → pod IP → CNI (veth, overlay vs native routing) → container.
kube-proxy is L4 only — connection-level DNAT, not request/latency aware; this is why h2/gRPC behind a bare ClusterIP hits the multiplexing trap (gw-02) and why you still need real LB (gw-06).
CNI wiring (veth pairs, IPAM, overlay vs native), NetworkPolicy, externalTrafficPolicy (Cluster vs Local — even spread + SNAT vs client-IP preservation), and NRI/OCI hooks for per-workload runtime customization (the Netflix talk).

docs/analysis.md lists the hands-on cluster exercises (trace a packet with nsenter, watch EndpointSlices during a rollout, reproduce the drain race and conntrack exhaustion on a kind cluster) to do once you have the simulated reflexes.

4. Exercises

Reproduce both on a real kind cluster: run a wrk loop against a Service while deleting a pod with vs without a preStop that fails readiness; then lower nf_conntrack_max and blast short connections. Confirm the simulated numbers match reality qualitatively.
Trace a packet: nsenter into a pod's netns, find its veth peer, dump the iptables/ipvsadm rules for the ClusterIP, follow the DNAT to a pod IP.
Size a grace period: given EndpointSlice propagation P and in-flight request duration D, derive terminationGracePeriodSeconds for an L7 node and for a 200k-connection WebSocket node (gw-05).
Model externalTrafficPolicy: extend the drain sim to compare Cluster (even spread, SNAT, loses client IP) vs Local (client-IP preserved, node-local only) and the load-imbalance trade-off.
Wire to gw-08: turn the sharded EndpointSlices into EDS resources and feed the gw-04 subset ring; debounce re-subsetting under pod flapping so the fix doesn't cause churn.

gw-09 — References

The Netflix angle (named in the JD)

Managing Netflix's Compute with Kubernetes and Dynamic… — the Titus→Kubernetes story; custom scheduling/controllers; running fleets on K8s. (Search the Netflix TechBlog / KubeCon talks.)
Container Runtime Customization at Netflix (NRI & OCI Hooks) — containerd NRI + OCI hooks to customize per-pod networking/storage/ sidecars while staying K8s-compatible. https://github.com/containerd/nri
Titus: Introducing Containers to the Netflix Cloud — ACM Queue. https://queue.acm.org/detail.cfm?id=3158370

Specs & docs

CNI spec — the plugin contract (ADD/DEL/CHECK, IPAM). https://github.com/containernetworking/cni/blob/main/SPEC.md
Kubernetes networking docs — the cluster network model, Services, EndpointSlices, DNS, NetworkPolicy. https://kubernetes.io/docs/concepts/services-networking/
kube-proxy modes (iptables / IPVS) and the dataplane comparison.
OCI runtime spec — hooks (createRuntime, prestart, etc.). https://github.com/opencontainers/runtime-spec

CNI implementations to study

Cilium (eBPF dataplane, replaces kube-proxy, L3–L7 NetworkPolicy, Hubble observability). https://github.com/cilium/cilium
Calico (native routing / BGP, policy). AWS VPC CNI (real VPC IPs, no overlay). flannel (simple VXLAN overlay).

Debugging & background

Networking and Kubernetes (O'Reilly) — the end-to-end packet path.
Brendan Gregg / eBPF resources — bpftrace, tcpdump, tracing the kernel network path.
Articles on conntrack exhaustion and kube-proxy iptables scaling (common production incidents).

Tooling

kubectl get endpointslices, kubectl get svc -o wide — see membership.
nsenter / ip netns — enter a pod's netns; ip a, ip route.
iptables-save | grep <svc> / ipvsadm -Ln — read kube-proxy rules.
conntrack -L / sysctl net.netfilter.nf_conntrack_{count,max}.
cilium monitor, hubble observe — eBPF dataplane visibility.

Cross-lab dependencies

Upstream: gw-01/gw-05 (drain), gw-04 (membership/conntrack), gw-08 (control plane watches the K8s API).
Downstream: gw-10 (CRDs/operators on the same API machinery), gw-12 (running the fleet on K8s, NRI/OCI).

gw-09 — Analysis

This lab is operational rather than build-from-scratch: you can't re-implement the kernel network stack in a step, so the work is reproducing and reading the real machinery on a local cluster (kind / minikube / k3d) and connecting it to the drain, membership, and resilience concerns of the rest of the phase.

What to actually do (in lieu of `steps/` code)

Trace a packet. On a kind cluster, deploy a Service + 3 pods. nsenter into a pod's netns, find its veth peer on the node, dump the iptables/ipvs rules for the ClusterIP, and follow the DNAT to a pod IP. Write the path down hop by hop.
Watch EndpointSlices during a rollout. kubectl get endpointslices -w while you kubectl rollout restart a Deployment. Observe pods leaving/joining and the propagation delay. This is the gw-04 membership churn and the gw-08 EDS source.
Prove the drain race. Run a wrk loop against a Service while you delete a pod with (a) no preStop/short grace and (b) a preStop that fails readiness + sleeps. Show (a) drops requests and (b) does not. This is gw-01/gw-05 drain on real Kubernetes.
Exhaust conntrack (carefully). On a test node, lower nf_conntrack_max, blast many short connections, watch nf_conntrack_count hit the cap and connections drop. Then enable keep-alive (gw-04) and watch the flow count fall.
See the gRPC/h2 ClusterIP trap. Send many gRPC calls over one channel to a ClusterIP-fronted Service; observe the skew (gw-02); switch to a headless Service + client-side LB and watch it even out.

Required understanding (the design-review bar)

Drain ordering on K8s: preStop (fail readiness, start app drain) → EndpointSlice removal (eventually consistent) → SIGTERM → grace → SIGKILL. Keep serving until grace expires; never exit on first SIGTERM. Grace = seconds for L7, minutes for high-density WebSocket nodes (gw-05).
Membership source: EndpointSlices → (control plane) → EDS (gw-08) → subset ring (gw-04). Pod churn drives stability-under-change.
kube-proxy is L4 only: connection-level DNAT, not request/latency aware; explains why real LB (gw-06) and L7 (gw-02) still live in the gateway, and why h2/gRPC need more than a ClusterIP.
Hidden ceilings: conntrack table, ephemeral ports, iptables rule bloat / sync latency, MTU on overlays. These are the "mysterious latency/drops" root causes the JD's data-driven debugging targets.

Tradeoffs worth flagging

Overlay vs native routing. Overlays (VXLAN) work anywhere but add encapsulation overhead and MTU pitfalls; native routing / real VPC IPs (AWS VPC CNI, Calico BGP) are faster and simpler to debug but cloud/topology-specific.
iptables vs IPVS vs eBPF. Simplicity/ubiquity vs scale vs performance+observability. The right answer depends on Service count and the value of L7/eBPF visibility.
externalTrafficPolicy Cluster vs Local. Even spread + SNAT (loses client IP) vs client-IP preservation + node-local routing (possible imbalance). For an edge LB that needs the client IP/identity (gw-01, gw-07), Local (or PROXY protocol upstream) is often required.
NRI/OCI runtime customization vs portability. Customizing the runtime per workload (Netflix's NRI talk) unlocks specialized networking/sidecar behavior but adds a node-level component to operate; the win is doing it without forking Kubernetes.

What production adds beyond a local cluster

A production CNI at scale (IPAM exhaustion, BGP/route-table limits, NetworkPolicy enforcement, multi-cluster).
eBPF observability (Hubble) to actually see the dataplane.
Node-level runtime customization (NRI/OCI hooks) for the gateway/origin workloads.
PodDisruptionBudgets + long grace periods + surge control so fleet rollouts don't drain too many nodes at once (gw-12).
Tuned kernel limits (conntrack, ephemeral ports, somaxconn) baked into the node image for connection-heavy gateway workloads.

gw-09 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline). Optional for the real-cluster exercises: kind/k3d, kubectl, nsenter, conntrack.

One-shot

cd gw-09-kubernetes-networking && bash scripts/verify.sh   # tests + sims

Per-language workflow (Go)

cd gw-09-kubernetes-networking/src/go
go test -race -count=1 ./...        # drain race, conntrack, sharding
go run ./cmd/netsim                 # drain ordering + conntrack demo

Package map

File	What
`k8snet/drain.go`	drain / EndpointSlice propagation-race model; readiness-first ordering; grace sizing; slice sharding
`k8snet/conntrack.go`	conntrack table model + churn-vs-keepalive exhaustion sim
`cmd/netsim`	runs both simulations and prints the takeaways

Real-cluster exercises

See docs/analysis.md and GUIDE.md §4 for the hands-on cluster labs: trace a packet with nsenter, watch EndpointSlices during a kubectl rollout restart, reproduce the drain race and conntrack exhaustion on kind.

gw-09 — Verification

One command

cd gw-09-kubernetes-networking && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestDrainRaceDropsWithoutReadiness`	exiting before readiness fails drops `Σ(delay)×rps` requests during EndpointSlice propagation
`TestDrainNoDropWithReadinessFirst`	fail-readiness-first + adequate grace → zero drops
`TestDrainGraceTooShort`	a too-short grace SIGKILLs mid-propagation and drops (the gw-05 high-density-node hazard)
`TestConntrackExhaustionUnderChurn`	1000 new flows into a 100-slot table → 900 drops
`TestConntrackOkWithKeepAlive`	reusing one flow → zero conntrack drops
`TestConntrackReuseSlot`	a reused flow takes one slot; a third distinct flow at Max=2 is dropped
`TestEndpointSliceSharding`	endpoints shard into ≤N-per-slice (scalability over monolithic Endpoints)

All under -race.

What "green" does NOT guarantee

Simulations, not a real cluster. They model the mechanics deterministically; the real-cluster exercises (GUIDE §4, analysis.md) validate against an actual kind/k3d setup.
No live packet path. Tracing veth/iptables/IPVS/eBPF and the CNI wiring is hands-on cluster work (CONCEPTS §3, analysis.md).
No NRI/OCI runtime customization here — that's the gw-12 migration enabler.

gw-10 — Gateway API, CRDs, and the Operator Pattern

The JD asks for "Kubernetes internals (… CRDs, and Operator patterns)" and names "Kubernetes Gateway API, Istio Gateway." This lab closes the loop: the Gateway API is the standard, role-oriented, extensible Kubernetes way to express L7 routing (the successor to Ingress), and the operator pattern is how you implement a gateway's control plane as a Kubernetes controller — watch custom resources, reconcile them into data-plane config (the xDS of gw-08, or vendor config), and report status back.

This is where gw-08 (control plane) and gw-09 (K8s internals) combine. An operator is a control plane that speaks Kubernetes. You will build a controller-runtime operator that watches Gateway/HTTPRoute resources and reconciles them into proxy configuration — the exact shape of Envoy Gateway, Contour, and Istio.

1. What is it?

A Custom Resource Definition (CRD) extends the Kubernetes API with your own object types. Once registered, a CRD's resources are first-class: kubectl get, RBAC, watches, status — all work. A controller/operator is a process that watches those resources and drives the real world to match them, via the reconcile loop:

        ┌─────────── reconcile loop (level-triggered) ───────────┐
 watch ▶│  observe desired state (the CRD spec)                   │
        │  observe actual state (the data plane / external system)│
        │  compute the diff                                       │
        │  take actions to converge actual → desired             │
        │  write status back to the CRD                           │
        └──────────────────────▲─────────────────────────────────┘
                               │ re-trigger on any change or resync

The key property is level-triggered, declarative reconciliation: the controller doesn't react to events ("a route was added"); it repeatedly drives toward the desired state ("these routes should exist"), so it's self-healing and idempotent — it converges no matter how many events it missed or how it was restarted.

The Gateway API is a set of standard CRDs for L7 routing, split by role:

Resource	Owner role	Meaning
GatewayClass	infra provider	"this kind of gateway is implemented by controller X" (like a StorageClass)
Gateway	cluster operator	"listen on these ports/protocols/hostnames with these certs"
HTTPRoute	app developer	"match these requests → send to these Services" (the gw-03 route table)
TCPRoute/GRPCRoute/TLSRoute	app developer	L4/gRPC/TLS variants
ReferenceGrant	namespace owner	cross-namespace permission to route to a Service

It replaces Ingress (which was too limited and annotation-hell) with a typed, role-separated, extensible model — and it's portable across implementations (Envoy Gateway, Istio, Contour, NGINX, cloud LBs).

2. Why does it matter?

It's the named technology and the future of K8s ingress. "Gateway API, Istio Gateway" is in the JD. Ingress is legacy; the Gateway API is where the industry (and a migration-minded team) is heading — and migrations are the team's bread and butter (gw-12).
The operator pattern is how gateway control planes are built on K8s. Envoy Gateway, Contour, and Istio are operators: watch Gateway API CRDs → reconcile into Envoy xDS. Building one teaches you exactly how the gateway you'd operate is wired, and the JD's "Operator patterns" requirement is literally this.
Reconciliation is the safest way to manage fleet config. Because it's level-triggered and idempotent, an operator self-heals: drift, partial failures, and restarts all converge. This is the same desired-state discipline as gw-08's reconcile loop, formalized by Kubernetes.
Status reporting closes the operability loop. A good operator writes back status (Accepted/Programmed/conditions) so users and dashboards know whether their HTTPRoute actually took effect — the CRD analog of xDS ACK/NACK (gw-08) and a key observability surface (gw-11).

3. How does it work?

The reconcile loop (controller-runtime)

Reconcile(ctx, req):                       # req = namespaced name of a changed object
    obj = get(req)                         # desired state from etcd
    if not found: return                   # deleted; finalizers handle cleanup
    actual = observe_external_state(obj)   # what the data plane currently has
    if actual != desired(obj):
        program_data_plane(desired(obj))   # converge (e.g. push xDS — gw-08)
    set_status(obj, Programmed=true)       # report back
    return (requeue if needed)

It runs whenever a watched object changes and on a periodic resync, so a missed event never leaves you wrong forever. It must be idempotent (reconciling the same state twice = no-op) and fast (work is queued; slow reconciles back up the queue).

How a request flows from CRD to data plane

app dev:  kubectl apply HTTPRoute (prefix /v1/play -> Service playback)
   │ watch
operator: reconcile -> compute routes/clusters/endpoints
   │ (endpoints from EndpointSlices — gw-09)
   ▼
data plane config: Envoy xDS snapshot (gw-08)  OR  reload a proxy
   ▼
Envoy/proxy serves /v1/play -> playback pods
   │
operator: write HTTPRoute.status.conditions = [Accepted, Programmed]

The operator is the bridge from the declarative K8s API (gw-09) to the imperative data-plane config (gw-08). That's the whole job.

CRD machinery you must know

OpenAPI schema + validation — the CRD declares its spec's shape; the API server validates on write (plus webhooks for richer rules).
Versioning + conversion webhooks — v1alpha2→v1 etc., with a conversion webhook so old and new clients coexist (vital for migrations).
Finalizers — block deletion until the controller cleans up external state (e.g. deprogram the data plane before the Gateway object disappears), preventing orphaned config.
Owner references — child objects (a Deployment the operator creates for a Gateway) get garbage-collected with the parent.
status subresource — separates user-written spec from controller-written status; conditions communicate Accepted / Programmed / errors.
Leader election — only one controller replica reconciles at a time (HA without double-writes).

Why level-triggered beats edge-triggered

Edge-triggered ("do X when event Y fires") loses correctness if you miss an event (controller was down, queue dropped it). Level-triggered ("make the world match the spec, repeatedly") is self-correcting: it re-derives the right state from scratch each time. This is the single most important operator concept and a frequent interview probe.

4. Core terminology

Term	Definition
CRD	Custom Resource Definition: extends the K8s API with a new typed object.
CR	A custom resource (an instance of a CRD).
Controller / Operator	A process that watches resources and reconciles real state to match.
Reconcile loop	The level-triggered converge-to-desired-state function.
Level- vs edge-triggered	React to state (self-healing) vs react to events (lossy).
Idempotent	Reconciling the same state repeatedly causes no extra change.
Gateway API	Standard CRDs for L7 routing: GatewayClass/Gateway/HTTPRoute/…
GatewayClass / Gateway / HTTPRoute	Provider / operator / app-dev layers of the routing model.
Finalizer	A marker that blocks deletion until cleanup runs.
Owner reference	Links child objects for cascading GC.
status / conditions	Controller-written state reporting outcome (Accepted/Programmed).
controller-runtime / kubebuilder	The Go libraries/scaffolding for building operators.
Leader election	Ensures a single active controller replica.

5. Mental models

An operator is a thermostat, not a light switch. A switch is edge-triggered (you flip it, once). A thermostat is level-triggered: it continuously compares actual temperature to the set point and acts to close the gap, forever. Miss a moment and it just corrects on the next read. That's why operators self-heal.
The Gateway API is org-chart-as-API. Ingress mashed everyone's concerns into one annotated object. Gateway API splits them by role: the infra team owns GatewayClass, the platform team owns Gateway (ports/certs), app teams own HTTPRoutes — each with its own RBAC. The schema encodes the org boundaries, which is why large orgs adopt it.
CRD + controller = "teach Kubernetes a new noun and a new verb." The CRD adds the noun (Gateway); the controller adds the behavior (what it means for a Gateway to exist). Kubernetes becomes a general-purpose reconciliation engine for your domain.
Status is the receipt. A user applies an HTTPRoute and walks away; status.conditions is how they learn it was Accepted and Programmed (or why not). Without it, "I applied it but is it live?" is unanswerable — the CRD version of an unmonitored xDS NACK (gw-08).

6. Common misconceptions

"Operators react to events." They reconcile to desired state. Events just trigger a reconcile; the reconcile re-derives everything from current state, so missing an event is survivable. Building an edge-triggered "on add do X, on delete do Y" controller is the classic beginner mistake that breaks on restart.
"Gateway API is just Ingress v2." It's a role-separated, extensible, portable, typed model with first-class L4/L7/TLS/gRPC routes, cross-namespace ReferenceGrant, and status conditions. Ingress couldn't express most of this without vendor annotations.
"CRDs make Kubernetes a database for my config." They're desired state, not just storage — the value is the controller continuously enforcing them. A CRD with no controller is inert YAML.
"Reconcile should be fast because it's called rarely." It's called on every watched change and on resync; under churn it's called a lot. It must be idempotent and quick; slow reconciles back up the work queue and delay convergence across all objects.
"One controller replica is a SPOF." Run multiple with leader election: one reconciles, the others stand by, failover is automatic. You get HA without two controllers fighting over the same resources.

7. Interview talking points

"Explain the operator/reconcile pattern." Level-triggered, declarative, idempotent reconciliation: observe desired (CR spec) + actual (external) state, converge, write status; re-run on change and resync. Contrast level- vs edge-triggered and explain why level-triggered self-heals. This is the answer that proves you understand Kubernetes, not just use it.
"What is the Gateway API and why does it exist?" Standard, role-oriented, extensible L7 routing CRDs (GatewayClass/Gateway/ HTTPRoute) replacing Ingress; portable across implementations (Envoy Gateway, Istio, Contour); typed L4/L7/gRPC/TLS routes with status conditions. The role separation maps to org structure and RBAC.
"How would you build a gateway control plane on Kubernetes?" An operator: watch Gateway API CRDs + EndpointSlices (gw-09) → reconcile into Envoy xDS snapshots (gw-08) → write status back. It's gw-08's reconcile loop, but the source of truth is the K8s API and the framework is controller-runtime (work queue, caching informers, leader election).
"Why finalizers and owner references?" Finalizers block deletion until you deprogram the data plane (no orphaned config / no traffic to a deleted backend); owner references cascade-delete the child objects an operator creates. Both prevent the "deleted the CR but the side effects linger" class of bugs.
"How do you safely evolve a CRD?" Versioned API (v1alpha2→v1) with a conversion webhook so old/new clients coexist; additive, optional fields; a deprecation window. This is a migration (gw-12) at the API level — exactly the "leading large-scale migrations" the JD prizes.
"How do users know their route is live?" status.conditions (Accepted, Programmed) written by the controller — the CRD analog of xDS ACK/NACK. Surface it in dashboards/alerts (gw-11). An operator that doesn't report status is unobservable.

8. Connections to other labs

gw-08 (Envoy/xDS) — the operator reconciles CRDs into the xDS snapshots gw-08 serves; together they are a complete K8s-native gateway control plane.
gw-09 (K8s networking) — built on the same API machinery; EndpointSlices feed the operator's endpoint computation; finalizers interact with pod/Service lifecycle.
gw-03 (API gateway) — an HTTPRoute is the declarative form of gw-03's route table; the operator programs the data plane that runs the filter chain.
gw-07 (security) — Gateway listeners reference TLS secrets; the operator wires certs (and can drive SDS).
gw-12 (migration) — migrating Ingress → Gateway API, and rolling out CRD changes safely, are textbook large-scale migrations; CRD versioning/conversion is migration tooling.
db-16…20 (consensus) — Kubernetes is etcd-backed (Raft, db-17); the reconcile loop's "converge to a replicated desired state" is the consensus mindset at the application layer.

gw-10 — The Hitchhiker's Guide to CRDs, Operators & the Gateway API

Companion to CONCEPTS.md, with the runnable operator in src/go/operator/. It closes the loop with gw-08: an operator is a control plane whose source of truth is the Kubernetes API.

The JD wants "Operator patterns" and the "Kubernetes Gateway API." This lab builds the operator pattern in its smallest correct form — an in-memory apiserver, a level-triggered reconcile loop, finalizers, and status — so you understand exactly what controller-runtime does for you. Run bash scripts/verify.sh:

apply (service missing):       Accepted=false(ServiceNotFound) Programmed=false dpPushes=0
service created (self-heal):   Accepted=true(Valid) Programmed=true(Synced) dpPushes=1
reconcile x2 (idempotent):     Accepted=true Programmed=true dpPushes=1
delete issued (Terminating):   exists=true ...
reconcile (finalizer cleanup): exists=false dpPushes=2 dpRoutes=0

That lifecycle — self-heal, idempotent, finalizer cleanup — is the whole pattern.

1. Level-triggered reconciliation (reconcile.go)

The single most important operator concept: a controller reconciles to desired state, it doesn't react to events. Reconcile(name) reads the current state every call and converges — so missing an event, or restarting the controller, never leaves the world permanently wrong.

TestLevelTriggeredSelfHealing proves it: apply an EdgeRoute whose target Service doesn't exist → status Accepted=false/ServiceNotFound. Then create the Service and reconcile again without touching the CR → status flips to Accepted=true/Programmed=true on its own. The controller re-derived the truth from current state. An edge-triggered "on add do X" controller couldn't do this — it would be stuck on the stale "service missing" it saw at creation. This is the thermostat (not light-switch) mental model made concrete.

Idempotency

TestIdempotentReconcile reconciles identical state five times and asserts the data plane was pushed once (dpPushes == 1). The MemDataPlane versions config by content hash, so reprogramming the same route set is a no-op — exactly the gw-08 content-hash debounce. Reconcile runs on every watched change and on resync, so under churn it's called constantly; it must be idempotent and fast or it backs up the work queue.

Rebuild the whole set, don't mutate incrementally

reprogram() rebuilds the data-plane config from all live EdgeRoutes every time, rather than adding/removing one. That's what makes reconcile idempotent and correct after missed events — and it's why TestDeleteReprogramsRemaining shows that deleting r1 leaves the data plane with exactly r2.

2. Status as the contract (reconcile.go)

A user applies a CR and walks away; status.conditions is how they (and dashboards) learn whether it took effect. The crucial ordering: Programmed=true is set only AFTER the data-plane push succeeds. TestStatusProgrammedOnlyAfterSuccess forces the data plane to fail (FailNext) and asserts status becomes Programmed=false/PushFailed — the CR analog of an xDS NACK (gw-08). A controller that reports success before the side effect actually happened is lying to its users; status must reflect reality. Surface Programmed=false / generation-vs-observed skew as a metric and alert on it (gw-11).

3. Finalizers: clean teardown (store.go, reconcile.go)

When you kubectl delete a CR, Kubernetes sets a deletionTimestamp but keeps the object while any finalizers remain — giving the controller a chance to clean up external state first. TestFinalizerCleanup walks it: the live object carries a finalizer; Delete puts it in Terminating (still Exists); the next Reconcile deprograms the data plane (the route is already excluded from ListLive because it's being deleted), then removes the finalizer, which lets the object actually disappear.

Why deprogram before the object is gone? Otherwise there's a window where the CR is deleted but traffic still flows to a dead backend. Finalizers close that window. The maintainer hazard (CONCEPTS §6): a finalizer whose cleanup can never complete (data plane permanently unreachable) blocks deletion forever — so bound the cleanup and keep a break-glass to remove a stuck finalizer.

4. The controller / work queue (controller.go)

Controller drains a work queue, calling Reconcile per key (TestControllerProcessesQueue). A real controller-runtime controller adds caching informers (watch once, serve many cached reads), a rate-limited requeue (retry with backoff), and leader election (one active replica, HA without double-writes). The level-triggered core — enqueue on change, reconcile to desired state — is exactly what you built. When you scaffold with kubebuilder, you're filling in this Reconcile method; everything else is the framework.

5. The Gateway API and how this becomes Envoy Gateway

The CONCEPTS file covers the Gateway API — the standard, role- separated routing CRDs (GatewayClass/Gateway/HTTPRoute) that replace Ingress. An HTTPRoute is the declarative form of gw-03's route table; this operator's EdgeRoute is a stand-in for it. Swap the MemDataPlane for gw-08's xDS SnapshotCache and point the Source at the Kubernetes API (Gateway-API CRDs + EndpointSlices, gw-09), and you have Envoy Gateway / Contour in miniature: watch CRDs → reconcile into xDS → push to Envoy → write status back. gw-08 + gw-10 together are a complete, K8s-native gateway control plane.

6. Hands-on

cd src/go
bash ../scripts/verify.sh        # tests + the lifecycle demo

# Then build the real thing with kubebuilder (steps/01): define the CRD,
# implement this exact Reconcile against a real apiserver, and watch
# `kubectl get edgeroute -o yaml` show the status conditions you set.

7. Exercises

Wire to gw-08: implement DataPlane over gw-08's SnapshotCache so the operator reconciles CRs into xDS and an Envoy serves them.
Watch the referenced Service: enqueue the EdgeRoute when its target Service's endpoints change (gw-09), so endpoint churn re-triggers reconcile — then your data plane always has current endpoints.
CRD versioning as migration: add a v2 spec field, write a conversion from v1, and keep both clients working — a migration at the API level (gw-12).
Leader election: run two controllers and add a simple lease so only one reconciles; prove no double-writes.
Real kubebuilder operator (steps/01): scaffold, define the CRD, port this Reconcile, deploy to kind, and confirm self-heal + finalizer cleanup on a live cluster.

gw-10 — References

Gateway API (named in the JD)

Kubernetes Gateway API — spec, the role model (GatewayClass/Gateway/HTTPRoute), conformance, and the migration guide from Ingress. https://gateway-api.sigs.k8s.io/
Envoy Gateway — Gateway API → Envoy; a reference implementation to read (it's an operator that emits xDS, tying gw-08+gw-10 together). https://gateway.envoyproxy.io/
Istio Gateway — both the Istio Gateway CRD and Istio's Gateway API support; istiod as the operator/control plane. https://istio.io/latest/docs/tasks/traffic-management/ingress/gateway-api/
Contour — another Gateway API control plane over Envoy.

Operator / controller framework

controller-runtime — the Go library: managers, controllers, caching informers, work queues, leader election. https://github.com/kubernetes-sigs/controller-runtime
kubebuilder — scaffolding + the operator book (the canonical tutorial). https://book.kubebuilder.io/
Operator SDK — alternative scaffolding (same controller-runtime underneath).
Programming Kubernetes (O'Reilly) — CRDs, informers, the client-go machinery under controller-runtime.

CRD machinery

Kubernetes docs — CustomResourceDefinitions, versioning + conversion webhooks, finalizers, owner references, the status subresource, admission webhooks. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
The "level-triggered vs edge-triggered" reconciliation writeups (Hausenblas/ kubebuilder; James Bowes "Level Triggering and Reconciliation").

Tooling

kind / k3d / minikube — a local cluster to run the operator.
kubectl api-resources, kubectl explain, kubectl get <cr> -o yaml — inspect CRDs and status.
kubectl apply the Gateway API CRDs, then your operator.

Cross-lab dependencies

Upstream: gw-08 (the control plane the operator drives), gw-09 (API machinery, EndpointSlices), gw-03 (the route model HTTPRoute expresses).
Downstream: gw-12 (Ingress→Gateway-API migration; CRD versioning as migration tooling), gw-11 (status conditions as observability).

gw-10 — Analysis

The design review for the operator you build in steps/.

Required behaviors

Level-triggered convergence. Reconcile drives actual → desired from current state every time; missing an event or restarting the controller never leaves the world permanently wrong.
Idempotent reconcile. Reconciling the same state twice makes no additional change (a hash/version guard prevents spurious data-plane pushes — same idea as gw-08).
Status reporting. Every reconciled CR gets status.conditions (Accepted, Programmed) so users/dashboards know if it took effect.
Clean teardown. A finalizer deprograms the data plane before the CR is deleted; owner references cascade-delete created children.
HA-safe. Leader election ensures one active reconciler; no double-writes.

Design decisions

controller-runtime over raw client-go. It provides caching informers (watch once, serve many reads), a rate-limited work queue, and leader election out of the box. The lab focuses on the reconcile logic, not the plumbing.
Reconcile = compute desired data-plane config, then push. The operator watches Gateway/HTTPRoute (+ EndpointSlices), builds the same kind of snapshot as gw-08, and pushes it (to an in-process xDS cache in the lab; to Envoy in production). This makes gw-08 and gw-10 literally the same control plane with a Kubernetes front end.
Content-hash versioning. The pushed config version is a hash of the desired state, so a reconcile that computes the same config is a no-op — debouncing under the heavy event churn of a busy cluster.
Status as the contract. The lab writes Programmed=true only after the data-plane push succeeds, so status reflects reality, not intent. A failed push sets Programmed=false with a reason — the CRD analog of an xDS NACK.

Tradeoffs worth flagging

Reconcile granularity vs churn. Reconciling the whole gateway on any change is simple but can thrash under churn; reconciling per-route is finer but more complex. Debounce + content-hash versioning is the pragmatic middle (gw-04/gw-08 echo).
Caching staleness vs API load. Informers serve cached reads (fast, low API load) but can be momentarily stale; level-triggered reconcile tolerates this (it'll re-run). Don't bypass the cache with live reads except where you truly need strong consistency.
Finalizer deadlocks. A finalizer that can't complete (e.g. the data plane is unreachable) blocks deletion forever. Bound the cleanup, and have a break-glass to remove a stuck finalizer — a real operational hazard to call out.
CRD evolution. Changing the CRD schema is itself a migration: additive optional fields are safe; renames/removals need a new version
- conversion webhook + deprecation window. Plan it like any migration (gw-12).

What production adds beyond this lab

Full Gateway API conformance (all route types, ReferenceGrant, cross-namespace rules, policy attachment).
Admission/validation webhooks for richer validation than the OpenAPI schema allows, and defaulting webhooks.
Multi-tenancy + RBAC mapped to the Gateway API role split (infra/operator/app-dev).
Robust status + events + metrics (reconcile latency, queue depth, error rate, programmed-vs-desired drift) — gw-11.
Conversion webhooks and a tested upgrade path for CRD versions, plus leader election, graceful shutdown, and a tested HA failover.

gw-10 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline). Optional for the real operator: kubebuilder, kind/k3d, kubectl.

One-shot

cd gw-10-gateway-api-operators && bash scripts/verify.sh   # tests + lifecycle demo

Per-language workflow (Go)

cd gw-10-gateway-api-operators/src/go
go test -race -count=1 ./...        # self-heal, idempotency, status, finalizers, controller
go run ./cmd/operatordemo          # the reconcile lifecycle, step by step

Package map

File	What
`operator/store.go`	in-memory "apiserver": objects, generation, finalizers, deletionTimestamp
`operator/dataplane.go`	the programmed sink (content-hash versioned, idempotent, can fail)
`operator/reconcile.go`	level-triggered reconcile: self-heal, status, finalizer cleanup, reprogram-all
`operator/controller.go`	work-queue controller (Enqueue / ProcessAll / Run)
`cmd/operatordemo`	the lifecycle demonstration

Real operator

steps/01-crd-and-reconcile.md scaffolds the same Reconcile with kubebuilder against a real apiserver; steps/02 wires the data-plane program + finalizers. See GUIDE.md §5 for how this becomes Envoy Gateway with gw-08.

gw-10 — Verification

One command

cd gw-10-gateway-api-operators && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestLevelTriggeredSelfHealing`	status flips on its own when the underlying Service appears — no CR change needed (level-triggered)
`TestIdempotentReconcile`	reconciling identical state 5× pushes the data plane once (content-hash idempotency)
`TestStatusProgrammedOnlyAfterSuccess`	`Programmed=true` only after the push succeeds; a failed push → `Programmed=false/PushFailed` (the CR's xDS-NACK analog)
`TestFinalizerCleanup`	delete → Terminating (object kept) → reconcile deprograms + drops finalizer → object removed
`TestDeleteReprogramsRemaining`	deleting one route leaves the data plane with exactly the others
`TestControllerProcessesQueue`	the work-queue controller reconciles enqueued keys

All under -race.

Demo (operatordemo, in verify.sh)

Shows the full lifecycle: apply (service missing → Accepted=false), self-heal (service created → Programmed=true), idempotent reconciles (pushes stays 1), delete (Terminating), finalizer cleanup (removed + data plane reprogrammed to 0 routes).

What "green" does NOT guarantee

In-memory apiserver, not real Kubernetes. Production uses controller-runtime + a real apiserver (informers, rate-limited requeue, leader election) — the level-triggered core is identical (steps/01).
EdgeRoute is a stand-in for HTTPRoute. Full Gateway API conformance is out of scope (CONCEPTS §3).
DataPlane is MemDataPlane. Wiring it to gw-08 xDS → real Envoy is an exercise (GUIDE §7.1).

gw-10 step 01 — A CRD and a level-triggered reconcile loop

Goal

Define a CRD, write a controller-runtime reconciler that converges actual → desired state idempotently, and report status. This is the operator pattern in its smallest correct form.

Scaffold (kubebuilder)

mkdir gw10-operator && cd gw10-operator
kubebuilder init --domain 10xdev.io --repo github.com/10xdev/gw10
kubebuilder create api --group net --version v1 --kind EdgeRoute \
  --resource --controller

Code — the CRD type (`api/v1/edgeroute_types.go`)

package v1

import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

// EdgeRouteSpec is the DESIRED state (what the user writes).
type EdgeRouteSpec struct {
	Hostname string `json:"hostname"`
	Prefix   string `json:"prefix"`
	Service  string `json:"service"` // target k8s Service name
	Port     int32  `json:"port"`
}

// EdgeRouteStatus is CONTROLLER-written outcome (the receipt).
type EdgeRouteStatus struct {
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
type EdgeRoute struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`
	Spec   EdgeRouteSpec   `json:"spec,omitempty"`
	Status EdgeRouteStatus `json:"status,omitempty"`
}

Code — the reconciler (`internal/controller/edgeroute_controller.go`)

package controller

import (
	"context"

	netv1 "github.com/10xdev/gw10/api/v1"
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
)

type EdgeRouteReconciler struct {
	client.Client
	Programmer DataPlane // pushes config to the proxy (step 02 / gw-08)
}

// Reconcile is LEVEL-TRIGGERED: it derives the right state from scratch
// every call. It must be idempotent and survive being called any number
// of times in any order.
func (r *EdgeRouteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var er netv1.EdgeRoute
	if err := r.Get(ctx, req.NamespacedName, &er); err != nil {
		// NotFound = deleted. Level-triggered: nothing to do here; the
		// data plane is reconciled from the full set of EdgeRoutes.
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// 1. Observe ACTUAL state: does the target Service exist / have endpoints?
	var svc corev1.Service
	svcErr := r.Get(ctx, client.ObjectKey{Namespace: req.Namespace, Name: er.Spec.Service}, &svc)
	if errors.IsNotFound(svcErr) {
		r.setCondition(ctx, &er, "Accepted", metav1.ConditionFalse,
			"ServiceNotFound", "target Service does not exist")
		return ctrl.Result{}, nil
	}

	// 2. Converge: program the data plane for THIS route (idempotent).
	if err := r.Programmer.Program(ctx, &er); err != nil {
		r.setCondition(ctx, &er, "Programmed", metav1.ConditionFalse,
			"PushFailed", err.Error()) // the CRD analog of an xDS NACK
		return ctrl.Result{Requeue: true}, nil
	}

	// 3. Report status: only Programmed=true AFTER the push succeeded.
	r.setCondition(ctx, &er, "Accepted", metav1.ConditionTrue, "Valid", "route accepted")
	r.setCondition(ctx, &er, "Programmed", metav1.ConditionTrue, "Synced", "data plane updated")
	return ctrl.Result{}, nil
}

func (r *EdgeRouteReconciler) setCondition(ctx context.Context, er *netv1.EdgeRoute,
	t string, s metav1.ConditionStatus, reason, msg string) {
	meta.SetStatusCondition(&er.Status.Conditions, metav1.Condition{
		Type: t, Status: s, Reason: reason, Message: msg,
		ObservedGeneration: er.Generation,
	})
	_ = r.Status().Update(ctx, er)
}

// SetupWithManager wires the watch: reconcile on EdgeRoute changes AND
// on changes to the Services they reference (so endpoint churn re-triggers).
func (r *EdgeRouteReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&netv1.EdgeRoute{}).
		Owns(&corev1.Service{}).
		Complete(r)
}

Tasks

Scaffold, define the CRD, install it (make install), run the controller (make run).
kubectl apply an EdgeRoute pointing at a missing Service; confirm status.conditions shows Accepted=False / ServiceNotFound. Create the Service; confirm it flips to Accepted=True / Programmed=True without you touching the EdgeRoute — that's level-triggered self-healing.
Kill and restart the controller; confirm it reconciles all existing EdgeRoutes back to correct status on startup (no events needed).

Acceptance

A working CRD + reconciler that reports accurate status conditions.
Demonstrated self-healing: fixing the underlying Service flips status with no change to the CR.
Controller restart reconverges from scratch (proves level-triggered).

Discussion prompts

Why must Reconcile be idempotent, and what breaks if it isn't?
Why report status after the data-plane push, not before? (Status must reflect reality — the xDS ACK/NACK analogy.)
Why does watching the referenced Service matter? (Endpoint churn in gw-09 must re-trigger reconcile so the data plane stays current.)

gw-10 step 02 — Program the data plane, and clean up with finalizers

Goal

Make the operator a real control plane: reconcile the full set of routes into a data-plane config (the gw-08 xDS snapshot), and use a finalizer to deprogram cleanly on delete so no orphaned config or traffic-to-a-deleted-backend remains.

Code — the DataPlane programmer (bridges to gw-08)

package controller

import (
	"context"
	"sort"

	netv1 "github.com/10xdev/gw10/api/v1"
	"sigs.k8s.io/controller-runtime/pkg/client"
)

type DataPlane interface {
	Program(ctx context.Context, er *netv1.EdgeRoute) error
}

// xdsProgrammer rebuilds the WHOLE snapshot from all EdgeRoutes and
// pushes it (gw-08 SnapshotCache). Rebuilding the full set — rather than
// mutating incrementally — keeps reconcile idempotent and correct.
type xdsProgrammer struct {
	client.Client
	cache SnapshotSink // your gw-08 cache.SetSnapshot wrapper
}

func (p *xdsProgrammer) Program(ctx context.Context, _ *netv1.EdgeRoute) error {
	var list netv1.EdgeRouteList
	if err := p.List(ctx, &list); err != nil {
		return err
	}
	routes := make([]netv1.EdgeRoute, 0, len(list.Items))
	for _, er := range list.Items {
		if er.DeletionTimestamp.IsZero() { // exclude routes being deleted
			routes = append(routes, er)
		}
	}
	// Deterministic order -> stable version -> no spurious pushes (gw-08).
	sort.Slice(routes, func(i, j int) bool { return routes[i].Name < routes[j].Name })

	snap := buildSnapshotFromRoutes(routes) // -> listeners/routes/clusters/EDS
	return p.cache.Set(ctx, "edge-envoy", snap)
}

Code — finalizer for clean teardown

const finalizer = "net.10xdev.io/deprogram"

func (r *EdgeRouteReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var er netv1.EdgeRoute
	if err := r.Get(ctx, req.NamespacedName, &er); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// Handle deletion: run cleanup, THEN drop the finalizer to let it go.
	if !er.DeletionTimestamp.IsZero() {
		if controllerutil.ContainsFinalizer(&er, finalizer) {
			// Deprogram: rebuild the snapshot WITHOUT this route so the
			// data plane stops sending it traffic before the object vanishes.
			if err := r.Programmer.Program(ctx, &er); err != nil {
				return ctrl.Result{Requeue: true}, nil // retry cleanup
			}
			controllerutil.RemoveFinalizer(&er, finalizer)
			_ = r.Update(ctx, &er)
		}
		return ctrl.Result{}, nil
	}

	// Ensure the finalizer is present on live objects.
	if !controllerutil.ContainsFinalizer(&er, finalizer) {
		controllerutil.AddFinalizer(&er, finalizer)
		_ = r.Update(ctx, &er)
	}

	// ... normal reconcile (program + status) from step 01 ...
	return ctrl.Result{}, nil
}

Tasks

Implement xdsProgrammer reusing your gw-08 SnapshotCache. Apply two EdgeRoutes; confirm Envoy serves both (config came from CRDs → operator → xDS → Envoy: the full stack).
Add the finalizer. kubectl delete an EdgeRoute; confirm the object stays in Terminating until the operator deprograms it, then disappears — and Envoy stops routing that prefix before the object is gone (no orphaned config).
Break-glass: simulate the data plane being unreachable during delete; show the object is stuck in Terminating (finalizer can't complete), then remove the finalizer by hand and explain why that's the emergency escape hatch.

Acceptance

End-to-end: kubectl apply EdgeRoute → Envoy serves the route; kubectl delete → route deprogrammed before deletion completes.
Finalizer prevents orphaned data-plane config; you can articulate the stuck-finalizer hazard and its break-glass.

Discussion prompts

Why rebuild the entire snapshot from all EdgeRoutes each reconcile instead of incrementally adding/removing one? (Idempotency, correctness after missed events, matches gw-08's content-hash versioning.)
Why deprogram before the object is deleted (finalizer) rather than after? (Avoid a window where the object is gone but traffic still flows to a dead backend.)
This operator + gw-08 is exactly Envoy Gateway / Contour in miniature. What does a production version add (Gateway API conformance, webhooks, multi-tenancy, leader election, metrics)?

gw-11 — Observability for the Data Plane

The JD lists "uplift the … observability … posture for traffic traversing the cloud" as a primary responsibility, and twice demands the ability to "identify root causes using data." A gateway sits on the critical path for billions of requests; it is both the place where problems are seen first and the place best positioned to attribute them. Observability is not dashboards-after-the-fact — it's the designed-in ability to ask new questions of a running system and get answers fast enough to act.

This lab covers the three pillars (metrics, logs, traces) as they apply to a proxy, the RED and USE method, the discipline of golden signals and SLOs/error budgets, distributed tracing across a proxy (context propagation — the subtle part), tail-latency measurement, and eBPF-based visibility. You will propagate W3C trace context through a proxy and build RED metrics you can alert on.

1. What is it?

Observability is the property that lets you understand a system's internal state from its outputs. Three pillars, each answering a different question:

Pillar	Answers	At a gateway
Metrics	"Is it healthy? How much / how fast / how bad?" (aggregates)	RED per route/cluster; cheap, always-on, alertable
Logs	"What exactly happened to this request?" (events)	structured access logs; high cardinality, sampled
Traces	"Where did the time/error go across services?" (causality)	one span per hop, stitched by propagated context

The methods that turn raw signals into judgment:

RED (for request-driven services like a gateway): Rate, Errors, Duration — per route, per cluster, per status class.
USE (for resources): Utilization, Saturation, Errors — CPU, memory, fds, connection pools, event-loop lag.
Golden signals (Google SRE): latency, traffic, errors, saturation — the four to put on every dashboard and page on.

And the discipline that makes them actionable:

SLI/SLO/error budget: pick a Service Level Indicator (e.g. p99 latency, success rate), set a target (SLO), and the allowed failure (error budget) governs how aggressively you ship vs stabilize (gw-12).

2. Why does it matter?

The JD makes it a core deliverable and a core skill. "Uplift the observability posture" is the build side; "root cause using data" is the operate side. Every lab's debugging section in this phase assumes the signals this lab defines.
A gateway is the system's best vantage point. It sees every request, every route, every backend, every status. RED metrics at the gateway attribute a problem to a route/cluster faster than any single service can — which is why edge observability is disproportionately valuable.
Tail latency is invisible without the right metrics. Averages lie; p99/p99.9 is what users feel and where retries/hedging/outliers (gw-06) show up. You must measure distributions (histograms), not means — and know why you can't average percentiles across instances.
Tracing across a proxy is where context-propagation bugs live. If the gateway doesn't propagate (and create) trace context correctly, every downstream trace is broken — you lose the ability to see where latency goes. This is a specific, common, high-impact gateway responsibility.

3. How does it work?

RED metrics with the right instruments

rate     = counter: requests_total{route,cluster,method}        (rate() in queries)
errors   = counter: requests_total{...,status_class="5xx"}      (errors/rate ratio)
duration = HISTOGRAM: request_duration_seconds_bucket{route,...} (p50/p99 via quantiles)

Duration must be a histogram (bucketed), not a gauge or summary, so you can compute percentiles and aggregate across instances correctly. A counter for retries (retries_total / requests_total — the gw-06 amplification signal) and gauges for connections, pool utilization, and event-loop lag round it out.

Why you can't average percentiles

p99 of instance A and p99 of instance B do not average to the fleet p99 — percentiles aren't linear. The correct approach: export histogram buckets per instance, sum the buckets fleet-wide, then compute the percentile from the merged histogram. Getting this wrong is a classic "our dashboards say p99 is fine but users are unhappy" bug.

Distributed tracing across the proxy

A trace is a tree of spans (one unit of work each), stitched by a propagated context. The gateway must:

Extract incoming context from headers (W3C traceparent, or b3 for Zipkin lineage).
Create its own span for the proxy hop (so gateway latency is visible as its own segment).
Inject the (possibly new/child) context into the request it forwards to the origin — so the origin's spans join the same trace.

traceparent: 00-<32-hex trace-id>-<16-hex span-id>-01
             │   │                   │               └ flags (01 = sampled)
             │   │                   └ this hop's span id
             │   └ the trace id (same across the whole request tree)
             └ version

If the gateway drops or fails to forward traceparent, the origin starts a new trace and the chain is severed — you can no longer see "the gateway added 3ms, the origin added 200ms." Propagation is the whole game.

Sampling

Tracing every request is expensive; sampling keeps cost bounded. Two models: head-based (decide at the start, propagate the decision in the flags bit — simple, may miss rare errors) and tail-based (buffer spans, decide after seeing the whole trace — can keep all errors/slow traces, costs a collector that buffers). The gateway typically honors the incoming sampling decision and can force-sample errors.

Access logs: structured and sampled

Per-request structured logs (JSON) with route, status, duration, bytes, upstream, retry count, trace id. High cardinality and volume → sample the successes, keep the errors. The trace id in the log line is what links logs ↔ traces ↔ metrics ("exemplars").

eBPF and the data plane

eBPF (Cilium/Hubble, Pixie) observes the kernel network path without app changes: per-connection bytes, retransmits, DNS, even L7 with parsers — the gw-01/gw-09 layer made visible. Great for "is it the app or the network?" questions where app metrics can't see.

SLOs and error budgets (the operating discipline)

SLO: 99.9% of requests succeed over 28 days
error budget = 0.1% of requests may fail = your "failure currency"
  budget healthy  -> ship features / risky migrations (gw-12)
  budget burning  -> freeze changes, stabilize
burn-rate alerts: page when you'll exhaust the budget fast, not on every blip

4. Core terminology

Term	Definition
Three pillars	Metrics (aggregates), logs (events), traces (causality).
RED	Rate, Errors, Duration — for request-driven services.
USE	Utilization, Saturation, Errors — for resources.
Golden signals	Latency, traffic, errors, saturation.
Histogram	Bucketed latency metric enabling correct percentiles + aggregation.
p99 / tail latency	High percentiles; what users feel; can't be averaged across instances.
Span / trace	A unit of work / a tree of spans for one request across services.
Context propagation	Carrying trace context across hops (W3C `traceparent`, `b3`).
Sampling (head/tail)	Deciding which traces to keep, up-front vs after the fact.
Exemplar	A trace id attached to a metric/log so you can pivot between them.
SLI / SLO / error budget	Indicator / target / allowed failure that governs risk.
Burn rate	How fast you're consuming the error budget; the basis for good alerts.
Cardinality	The number of distinct label combinations; the cost driver of metrics.
eBPF	In-kernel programs for app-transparent network/L7 observability.

5. Mental models

Metrics are the dashboard gauges; traces are the dashcam; logs are the black box. Gauges tell you something's wrong and roughly where; the dashcam (trace) shows the moment across the whole journey; the black box (logs) has the per-event detail for forensics. You need all three, and the ability to jump between them (exemplars).
A percentile is a property of a distribution, not a number you can add. You can sum counts in buckets and recompute, but you cannot average p99s. Treat latency as a shape (histogram), not a value.
Trace context is a baton in a relay. Each runner (service) must take the baton (extract), run their leg (their span), and pass it on (inject). Drop the baton at the gateway and the rest of the race is untimed — you'll never know which leg was slow.
The error budget is a spending account. Reliability you don't need is money left on the table; reliability you've overspent (budget burned) means stop shipping and fix. It converts "how reliable should we be?" from an argument into arithmetic — and it's what licenses a risky migration (gw-12) when the budget is healthy.
Cardinality is the credit-card bill of observability. Every new label dimension (especially unbounded ones like user-id or full URL) multiplies series and cost. Label by bounded dimensions (route, status class, cluster); put the high-cardinality stuff in traces/logs.

6. Common misconceptions

"Average latency is fine." Averages hide the tail; a 5ms mean can hide a 2s p99 that's hurting 1% of users (and 1% of 1M rps = 10k unhappy users/sec). Always measure and alert on percentiles via histograms.
"Just trace everything." Tracing every request at high rps is expensive and noisy; sample (head- or tail-based), but force-keep errors and slow traces. The skill is sampling and still catching the rare bad trace.
"The gateway doesn't need to do anything for tracing." It must extract, create its own span, and inject context downstream. Silent failure to propagate breaks every downstream trace — a high-impact bug unique to proxies.
"More dashboards = more observability." Observability is the ability to answer new questions, not to pre-build every chart. RED + golden signals + exemplars + good labels beat 50 bespoke panels.
"Alert on every error." That trains people to ignore alerts. Alert on symptoms (SLO burn rate, user-facing latency/errors), page rarely, and let dashboards/traces handle diagnosis. Cause-based alerts are for runbooks, not pagers.

7. Interview talking points

"p99 latency just doubled — walk me through it." The flagship ops-round answer (gw-00 INTERVIEW.md): scope (region/route/cluster) → golden signals (which moved first) → correlate with deploys/config/ origin events → walk the request path (TLS/filters/LB/pool/origin) → falsifiable hypothesis → confirm with one metric → mitigate then fix. Name the metrics at each step.
"How do you measure latency correctly across a fleet?" Histograms per instance, summed bucket-wise, percentiles computed from the merged histogram — because percentiles don't average. This single answer separates people who've operated services from those who haven't.
"What does the gateway do for distributed tracing?" Extract incoming traceparent/b3, create a span for the proxy hop, inject context downstream so origin spans join the trace; honor (and force-on-error) the sampling decision; attach the trace id to access logs as an exemplar. Dropping propagation breaks every downstream trace.
"What's RED vs USE and which for a gateway?" RED (Rate/Errors/ Duration) for the request flow; USE (Utilization/Saturation/Errors) for resources (CPU, fds, pools, event-loop lag). A gateway needs both: RED for user impact, USE to find the saturated resource behind it.
"How do SLOs change how you work?" They convert reliability into a budget: healthy budget → ship/migrate aggressively (gw-12); burning budget → freeze and stabilize. Alert on burn rate (symptom), not on every error (cause), so pages are rare and meaningful.
"Your dashboards look healthy but users complain — why?" Averaged percentiles, too-coarse labels hiding a bad route, success-only sampling hiding the error traces, or the metric measured at the wrong point (server-side latency excludes queueing/network the user experiences). Know these failure modes of observability itself.

8. Connections to other labs

gw-01 / gw-09 (L4 / K8s) — accept-queue depth, conntrack, retrans, event-loop lag are the saturation signals; eBPF makes them visible.
gw-03 (API gateway) — the outbound filter is where access logs, metrics, and span completion happen; per-filter timings are RED's detail.
gw-04 (connection management) — connections.created.rate is the churn metric; pool utilization and acquisition wait are saturation.
gw-06 (resilience) — retries/requests, circuit state, ejection events, and the adaptive-concurrency limit vs in-flight are the signals you alert on; observability is what makes resilience tunable.
gw-08 / gw-10 (control plane) — applied config version, NACK rate, reconcile latency, and status conditions are the control-plane's observability.
gw-12 (migration) — shadow/canary comparisons, SLO-gated rollout, and automated rollback all run on the signals defined here.

gw-11 — The Hitchhiker's Guide to Data-Plane Observability

Companion to CONCEPTS.md, with the runnable primitives in src/go/obs/. The JD demands "identify root causes using data" twice — this lab builds the data, correctly.

A gateway sits on the critical path for billions of requests; it sees problems first and is best placed to attribute them. But observability is full of subtle ways to lie to yourself. This lab builds three primitives that get it right: a mergeable histogram (so fleet percentiles are correct), W3C trace propagation (so traces don't break at the proxy), and RED metrics (the signals you alert on).

Run bash scripts/verify.sh:

percentiles can't be averaged across instances:
  instance A p99 = 0.001s
  instance B p99 = 0.100s
  AVG of p99s    = 0.051s   <- WRONG
  MERGED p99     = 0.100s   <- RIGHT (sum buckets, then quantile)
trace propagation through the gateway:
  incoming traceparent: 00-aaaa...-bbbb...-01
  forwarded downstream: 00-aaaa...-cccc...-01
  same trace-id? true   new span-id? true

1. You cannot average percentiles (histogram.go)

This is the single most important — and most violated — rule of latency monitoring. A percentile is a property of a distribution, not a number you can add. TestPercentilesMergeNotAverage makes it undeniable: two instances, one all-1ms and one all-100ms. Averaging their p99s gives 51ms — a number that describes neither instance and isn't the fleet's p99. The fleet p99 is 100ms (half the fleet is slow), and the only correct way to get it is to merge the histogram buckets, then compute the quantile (merged := a.Clone(); merged.Merge(b); merged.Quantile(0.99)).

That's why latency must be a histogram (bucketed counts), not a gauge or a pre-computed summary: histograms with shared bounds are additive, so you sum per-instance buckets fleet-wide and compute the percentile from the merged distribution. TestHistogramQuantile pins the bucket-based quantile (p99 of "99% fast, 1% slow" is the fast bucket; p99.9 is the slow tail). Get this wrong and your dashboards say "p99 is fine" while users suffer — the classic "observability of the observer" failure.

In PromQL this is histogram_quantile(0.99, sum by (le) (rate(..._bucket[5m]))) — sum the buckets, then quantile. The lab's Merge is exactly that sum by (le).

2. Trace propagation: don't sever the trace (trace.go)

A trace is a tree of spans stitched by a propagated context. A proxy has a unique, high-impact responsibility: extract the incoming traceparent, create its own span for the proxy hop (same trace-id, new span-id), and inject it into the request it forwards — so the origin's spans join the same trace.

ParseTraceparent parses the W3C format (00-<32hex trace>-<16hex span>-<2hex flags>) and rejects malformed or all-zero ids (TestParseTraceparent). NewChild is the propagation rule: keep the trace-id and sampling decision, mint a new span-id (TestTracePropagationKeepsTrace). If the gateway drops the traceparent (or — worse — starts a fresh trace), the origin begins a disconnected trace and you lose the ability to see "the gateway added 3ms, the origin added 200ms." That severed-trace bug is unique to proxies and is exactly what the demo's same trace-id? true line guards against.

Extract/Inject work against any header carrier (http.Header.Get/ Set), and TestExtractMissingIsNotOK ensures an absent header isn't silently treated as a valid context.

3. RED metrics (red.go)

RED tracks Rate, Errors, Duration per route — the request-driven service's golden signals. TestRED records 100 OK + 5 5xx and checks Rate=105, ErrorRatio≈0.0476, P99=5ms. Two design choices matter:

Bounded labels only (route, status class). Never label by trace-id, user-id, or full path — that's a cardinality explosion that bankrupts your metrics store. High-cardinality detail goes in traces/logs, linked by an exemplar (the trace-id on a log line / metric sample).
The retry ratio is a first-class signal. RetryRatio = retries/requests is the early warning for a retry storm (gw-06): when it climbs toward your budget, you're amplifying. Alert on it.

The CONCEPTS file covers SLOs, error budgets, and burn-rate alerting in depth. The one-line rules:

Alert on symptoms, not causes. Page on SLO burn rate (user-facing latency/errors), not on every 5xx. Cause metrics (pool exhaustion, circuit state, NACKs) live on dashboards for diagnosis, not on the pager.
The "p99 doubled" drill (gw-00 INTERVIEW.md): scope → golden signals → correlate with deploys/config/origin → walk the path → falsifiable hypothesis → confirm with one metric → mitigate then fix. Every primitive here feeds a step of that drill.

5. Hands-on

cd src/go
bash ../scripts/verify.sh        # tests + the two demos

# Wire RED + trace propagation into the gw-03 gateway's outbound filter
# and endpoint, then drive load and watch a stitched trace + a correct
# fleet p99 (exercise §6.1).

6. Exercises

Instrument gw-03: add an outbound RED filter and trace propagation in the proxy endpoint; expose /metrics; confirm a request produces a single trace spanning gateway+origin and a correctly-merged p99.
Burn-rate alert: implement an SLO (e.g. 99.9% success) and a multi-window burn-rate alert; show it pages on fast budget burn but not on a single blip.
Tail-based sampling: buffer spans for a trace and keep it only if it errored or exceeded a latency threshold; show you catch the rare bad trace while sampling the rest.
gRPC status from trailers (gw-02): make Record read grpc-status from trailers so a 200 + grpc-status 14 is counted as an error — a gateway that only reads :status undercounts failures.
Coordinated omission: drive load with a fixed-rate generator (wrk2) vs a closed-loop one (wrk) and show how the latter hides the tail — then explain why you measure at admission, not just at response write.

gw-11 — References

Methods & discipline

Google SRE Book — "Monitoring Distributed Systems" (golden signals), "Service Level Objectives," and the SRE Workbook's alerting-on-SLOs / burn-rate chapters. https://sre.google/sre-book/monitoring-distributed-systems/
Tom Wilkie — The RED Method; Brendan Gregg — The USE Method. https://www.brendangregg.com/usemethod.html
Observability Engineering (Majors/Fong-Jones/Miranda) — high- cardinality, events-not-just-metrics, debugging unknown-unknowns.

Standards & tooling

W3C Trace Context — traceparent/tracestate; the propagation format the gateway must extract/inject. https://www.w3.org/TR/trace-context/
OpenTelemetry — traces/metrics/logs SDKs + the Collector; the vendor-neutral standard. https://opentelemetry.io/docs/
Prometheus — histograms, histogram_quantile, rate(), exemplars, recording rules; and why summaries can't aggregate. https://prometheus.io/docs/practices/histograms/
Zipkin b3 propagation — the older multi-header format still common in JVM/Netflix-lineage stacks.
Cilium Hubble / Pixie — eBPF-based L3–L7 observability without app changes. https://github.com/cilium/hubble

Envoy / proxy observability

Envoy access logs, stats, and tracing config — a production reference for exactly the signals this lab defines. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/observability/statistics

Background

Gil Tene — How NOT to Measure Latency (coordinated omission; why your load test lies about the tail). A must-watch.
Marc Brooker — tail latency and percentiles posts.

Cross-lab dependencies

Upstream: every lab — this defines the signals their debugging sections use.
Downstream: gw-12 (shadow/canary diffing and SLO-gated rollout run on these signals).

gw-11 — Analysis

The design review for the instrumentation you add in steps/.

Required behaviors

RED per route/cluster. Rate, errors (by status class), and a latency histogram — not a mean — labeled by bounded dimensions.
Correct percentiles. Percentiles computed from histogram buckets, aggregatable across instances (never average p99s).
Trace context across the proxy. Extract incoming context, create a proxy span, inject downstream; never sever the trace.
Exemplars. Access logs carry the trace id so logs↔traces↔metrics link.
Symptom-based alerting. Alert on SLO burn rate, not on every error; pages are rare and user-facing.

Design decisions

Histogram for duration, counters for rate/errors/retries. Latency is a distribution; bucket it (with buckets chosen around the SLO, e.g. 1/5/10/25/50/100/250/500/1000ms). Errors are a counter labeled by status class so errors/rate is a ratio query. Retries get their own counter (the gw-06 amplification signal).
Bounded labels only. Label by route, method, status_class, cluster — never by user-id, full path, or trace-id (unbounded → cardinality explosion). High-cardinality detail lives in traces/logs, linked by exemplar.
OpenTelemetry for propagation. Use the OTel propagator to extract/inject W3C trace context, so the gateway interoperates with any backend. The span for the proxy hop makes gateway-added latency visible separately from origin latency.
Sampling: honor + force-error. Respect the incoming sampling decision (the flags bit) for cost control, but force-sample on error or high latency so the rare bad trace is never lost.

Tradeoffs worth flagging

Cost vs fidelity. More labels/buckets and more traces = more insight and more cost (storage, cardinality, collector load). Tune to the questions you actually need to answer; RED + golden signals + exemplars get you most of the way cheaply.
Head- vs tail-based sampling. Head-based is cheap and simple but may miss rare errors; tail-based keeps all errors/slow traces but needs a buffering collector and more resources. Many start head-based with error-forcing, then add tail-based where it pays.
Measurement point matters. Server-side latency excludes the queueing/network the user experiences (coordinated omission). Be explicit about where you measure (at admission? at response write?) and complement with client-side RUM where possible.
Observability of the observer. The instrumentation itself adds latency/CPU/allocations on the hot path (1M+ rps). Keep it cheap (lock-free counters, pre-allocated label sets); measure its overhead.

What production adds beyond this lab

A full pipeline: OTel Collector → metrics/trace/log backends, with recording rules and SLO/burn-rate alerts.
Tail-based sampling, exemplars wired end-to-end, and RUM for true user-experienced latency.
eBPF (Hubble/Pixie) for app-transparent network + L7 visibility to answer "app or network?" (gw-01/gw-09).
Control-plane observability: applied config version per node, NACK rate, reconcile latency, status-condition dashboards (gw-08/gw-10).
Automated anomaly detection and the shadow/canary diffing harness that gw-12's migrations depend on.

gw-11 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd gw-11-data-plane-observability && bash scripts/verify.sh   # tests + demos

Per-language workflow (Go)

cd gw-11-data-plane-observability/src/go
go test -race -count=1 ./...        # histogram/quantile/merge, traceparent, RED
go run ./cmd/obsdemo               # percentile-merge + trace-propagation demos

Package map

File	What
`obs/histogram.go`	bucketed latency histogram; bucket-based quantiles; bucket-wise Merge (correct fleet percentiles)
`obs/trace.go`	W3C `traceparent` parse/format; extract → new-span child → inject (don't sever the trace)
`obs/red.go`	RED metrics per route: rate, error ratio, p99, retry ratio (the gw-06 amplification signal)
`cmd/obsdemo`	the "can't average p99" + trace-propagation demonstrations

See GUIDE.md for the deep dive and the "p99 doubled" debugging drill.

gw-11 — Verification

One command

cd gw-11-data-plane-observability && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestHistogramQuantile`	bucket-based percentiles are correct (p99 = fast bucket, p99.9 = slow tail)
`TestPercentilesMergeNotAverage`	averaging per-instance p99s is wrong; merging buckets then quantile-ing is right (51ms vs 100ms)
`TestParseTraceparent`	valid W3C traceparent parses; malformed/all-zero/short/bad-version are rejected
`TestTracePropagationKeepsTrace`	the proxy child keeps the trace-id + sampling decision and mints a new span-id; inject/extract round-trips
`TestExtractMissingIsNotOK`	an absent traceparent is not silently treated as valid
`TestRED`	per-route rate, error ratio, p99, and retry ratio are computed correctly

All under -race.

Demo (obsdemo, in verify.sh)

avg-of-p99 (0.051s, WRONG) vs merged p99 (0.100s, RIGHT),
a traceparent forwarded with the same trace-id and a new span-id.

What "green" does NOT guarantee

Not a full pipeline. No OTel Collector / Prometheus backend; this is the in-process primitives. Wiring into gw-03 is an exercise (GUIDE §6.1).
No tail-based sampling / burn-rate alerts in the core package (exercises §6.2/§6.3).
HTTP status only. Reading gRPC status from trailers (gw-02) so a 200 + grpc-status 14 counts as an error is an exercise (§6.4).

gw-11 step 01 — RED metrics and trace-context propagation

Goal

Instrument the gw-03 gateway with RED metrics (a latency histogram, not a mean) and propagate W3C trace context across the proxy so downstream traces stitch together. Then practice the "p99 doubled" debugging drill against your own metrics.

Code — RED metrics (Prometheus client)

package obs

import "github.com/prometheus/client_golang/prometheus"

var (
	// Rate + Errors: one counter, labeled by status class.
	Requests = prometheus.NewCounterVec(prometheus.CounterOpts{
		Name: "gw_requests_total",
		Help: "requests by route/cluster/status class",
	}, []string{"route", "cluster", "method", "status_class"})

	// Retries: the gw-06 amplification signal (alert on retries/requests).
	Retries = prometheus.NewCounterVec(prometheus.CounterOpts{
		Name: "gw_retries_total",
	}, []string{"route", "cluster"})

	// Duration: HISTOGRAM (so percentiles aggregate correctly across
	// instances). Buckets chosen around the SLO.
	Duration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
		Name:    "gw_request_duration_seconds",
		Buckets: []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5},
	}, []string{"route", "cluster"})

	// Saturation gauges (USE): pool + event-loop health.
	PoolInUse = prometheus.NewGaugeVec(prometheus.GaugeOpts{
		Name: "gw_pool_inuse_connections",
	}, []string{"cluster"})
)

func init() {
	prometheus.MustRegister(Requests, Retries, Duration, PoolInUse)
}

Record in the gw-03 outbound filter (where the request is done):

func (RedMetrics) Apply(c *gw.RequestContext) {
	statusClass := fmt.Sprintf("%dxx", c.Resp.Status/100)
	obs.Requests.WithLabelValues(c.RouteName, c.RouteName, c.Req.Method, statusClass).Inc()
	obs.Duration.WithLabelValues(c.RouteName, c.RouteName).
		Observe(time.Since(c.StartedAt()).Seconds())
}

Cardinality discipline: labels are bounded (route, cluster, method, status_class). Never label by user-id, full path, or trace-id — those go in traces/logs, linked by exemplar.

Querying the right percentile (PromQL)

# p99 latency per route, aggregated across ALL instances correctly:
histogram_quantile(0.99,
  sum by (le, route) (rate(gw_request_duration_seconds_bucket[5m])))

# error ratio per route:
sum by (route) (rate(gw_requests_total{status_class="5xx"}[5m]))
  / sum by (route) (rate(gw_requests_total[5m]))

# the retry-storm early warning (gw-06):
sum(rate(gw_retries_total[1m])) / sum(rate(gw_requests_total[1m]))

Note you sum the buckets first, then take the quantile — you cannot average per-instance p99s.

Code — trace context across the proxy (OpenTelemetry)

package obs

import (
	"net/http"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/trace"
)

var propagator = propagation.TraceContext{} // W3C traceparent
var tracer = otel.Tracer("gateway")

// TraceProxy wraps the endpoint: extract incoming context, start a span
// for the proxy hop, inject context into the OUTBOUND request so the
// origin's spans join this trace.
func TraceProxy(c *gw.RequestContext, forward func(*http.Request) (*http.Response, error)) (*http.Response, error) {
	ctx := propagator.Extract(c.Req.Context(),
		propagation.HeaderCarrier(c.Req.Header)) // take the baton

	ctx, span := tracer.Start(ctx, "gateway.proxy",
		trace.WithSpanKind(trace.SpanKindServer))
	defer span.End() // this hop's span = gateway-added latency, visible separately

	out := c.Req.Clone(ctx)
	propagator.Inject(ctx, propagation.HeaderCarrier(out.Header)) // pass the baton on

	resp, err := forward(out)
	if err != nil {
		span.RecordError(err)
	} else {
		span.SetAttributes(/* http.status_code */)
	}
	// Exemplar: put the trace id in the access log so logs<->traces link.
	c.Attributes["trace_id"] = span.SpanContext().TraceID().String()
	return resp, err
}

Tasks

Add the RED metrics + /metrics endpoint to the gw-03 gateway; generate load and confirm rate/errors/duration appear per route.
Wire OTel propagation; run gateway → origin both instrumented, send a request with a traceparent, and confirm in your trace backend (Jaeger/Tempo) that one trace spans gateway + origin (not two disconnected traces).
Break propagation on purpose (don't inject); show the origin starts a new trace and the chain is severed — the high-impact gateway bug.
Run the "p99 doubled" drill: inject latency on one cluster; using only your PromQL, scope it to the route/cluster, confirm with the histogram, and pivot to a slow trace via the exemplar.

Acceptance

RED metrics with a correctly-aggregating p99 (summed buckets, then quantile) and a working retries/requests ratio.
A single stitched trace across gateway + origin; a demonstrated broken trace when propagation is dropped.
You can drive the "p99 doubled" investigation end-to-end on your own signals.

Discussion prompts

Why a histogram (not a summary/gauge) for latency, and why can't you average p99 across instances?
What exactly breaks downstream if the gateway forgets to inject traceparent? Why is the proxy uniquely responsible here?
Which of these signals would you put on a pager vs a dashboard, and why? (Symptom/SLO-burn on the pager; cause metrics on the dashboard.)

gw-12 — Capstone: Leading a Large-Scale Gateway Migration

The JD calls out "Evidence of leading large-scale migrations is a plus," plus driving alignment with stakeholders, setting partner expectations, technical mentorship, and high-quality design/code reviews. This capstone is where the technical phases (gw-01…gw-11) become a leadership problem. Anyone can write a better gateway; a "5" gets the whole fleet there — across billions of requests, dozens of stakeholder teams, and a zero-downtime requirement — without anyone noticing.

Netflix's history is a catalog of exactly these migrations: Zuul 1 → Zuul 2 (blocking → async, gw-03), the connection-churn rollout (gw-04), Pushy's re-architecture and density jump (gw-05), Titus → Kubernetes ("Managing Netflix's Compute"), and container runtime customization with NRI/OCI hooks to run specialized workloads on standard Kubernetes. This lab distills the playbook common to all of them and turns it into something you can speak to with authority.

There is no new protocol here — the capstone is the method: how to take a risky change to a critical, high-traffic system and ship it safely, measurably, and with the org aligned behind you.

1. What is it?

A large-scale migration is moving a critical system from state A to state B while it keeps serving production traffic, where the blast radius is "everything" and the rollback must be instant. The canonical gateway migrations:

Migration	From → To	The risk
Zuul 1 → 2	blocking thread-per-request → Netty async (gw-03)	a rewrite of the request path on every edge cluster
Connection churn	per-request connects → pooled+subsetted (gw-04)	changing how the whole fleet talks to every origin
Pushy evolution	smaller nodes → 200k–400k conns/node (gw-05)	re-architecting a stateful, hundreds-of-millions-of-connection fleet
Ingress → Gateway API	annotations → typed CRDs (gw-10)	re-expressing every route, across many app teams
Titus / VMs → Kubernetes	bespoke orchestration → K8s (gw-09)	moving the substrate the whole fleet runs on
Static config → xDS control plane	redeploys → dynamic push (gw-08)	introducing a new fleet-wide control loop

The method that makes any of them survivable rests on four pillars: (1) make it reversible, (2) validate with production traffic before you commit (shadow), (3) ramp gradually with automated guardrails (canary + SLO-gated rollout), and (4) keep humans aligned the whole way (design reviews, stakeholder comms, mentorship).

2. Why does it matter?

It's the differentiator for a "5". The technical chops in gw-01…gw-11 get you to senior. Leading a migration — scoping it, aligning a dozen teams, shipping it with zero customer impact, and mentoring others through it — is what the role is actually hiring the 5 for. "Evidence of leading large-scale migrations" is the tiebreaker.
The gateway's blast radius makes migrations uniquely scary. A bug in one service hurts that service; a bug in the edge hurts everything (gw-03). So the edge team has the most refined migration discipline in the company, and you'll be expected to wield it from early on.
It's mostly a people problem. The hardest part isn't the code — it's getting partner teams to adopt the new path, setting realistic expectations, running the design review that surfaces the objection you didn't think of, and reviewing others' code to the fleet-wide bar. The JD lists all of these as core responsibilities for a reason.
It's how the talks happen. Every talk in the JD is the story of a migration. Being able to narrate one — what you tried, what broke, what you measured, how you de-risked — in the exact shape of those talks is how you impress the panel and, later, become the person who gives the talk.

3. How does it work?

The migration ladder (lowest risk to highest commitment)

1. DARK / SHADOW     mirror prod traffic to the new path; DON'T serve its
                     responses to users; diff new-vs-old offline.
2. CANARY            serve a tiny % of real traffic on the new path;
                     compare RED metrics + SLOs vs the control group.
3. STICKY CANARY     keep the same users on the canary so you measure a
                     consistent cohort, not request-level noise.
4. RAMP             increase the % in stages (1% → 5% → 25% → 50% → 100%)
                     with automated rollback on SLO/NACK breach (gw-08/11).
5. CUTOVER + SOAK   100% on the new path; keep the old path warm for a
                     soak period so rollback stays instant.
6. DECOMMISSION     only after the soak proves stability: remove the old
                     path, delete the flags, update the runbooks.

You never skip straight to a flag flip. Each rung buys evidence that the next is safe.

Traffic shadowing / mirroring (the de-risking superpower)

The gateway is the perfect place to mirror: copy each request to the new path in addition to the real one, discard (or async-diff) the mirror's response, and serve the user from the old path. You get production-shaped load and inputs against the new system with zero user risk. Envoy supports request mirroring natively (gw-08); your gw-03 gateway can fork the request in the endpoint filter. The diffing (does new produce the same/acceptable output as old?) is where real bugs surface before any user is exposed.

Canary analysis (automated, not eyeballed)

canary group:  N% of traffic (or N% of nodes) on the new path
control group: the rest on the old path
compare: RED metrics, SLO burn, resource USE, business KPIs
         (a statistical comparison, e.g. Netflix's Kayenta/ACA)
decision: auto-promote if canary ≈ control; auto-rollback if worse

The key word is automated: a human watching a dashboard can't gate a fleet-wide ramp safely. SLO-gated, statistically-compared, auto-rollback canaries are the standard.

Reversibility and blast-radius control

Flag everything. The new path is behind a runtime flag (RTDS/ config push, gw-08), so disabling it is a config change, not a deploy.
Keep the old path warm. Don't decommission until the soak proves it; rollback must be instant, which means the old path must still be there.
Stage by blast radius. Roll out region by region, cluster by cluster, node group by node group — never globally at once (the thundering-herd / fleet-wide-blast lesson from gw-04/gw-08).
Cap the damage. Use the connection-churn and reconnect-storm lessons (gw-04/gw-05): staggered rollout, pre-warming, jittered restarts, PodDisruptionBudgets (gw-09) so you never drain too much at once.

The org side (where migrations actually fail)

Design review as the author. Write the doc: goals, non-goals, alternatives considered, rollout plan, rollback plan, the risks you're not mitigating and why. The review surfaces the objection you missed; running it well is a core JD duty.
Stakeholder alignment & partner expectations. Partner teams must adopt the new path (e.g. re-declaring routes for Ingress→Gateway-API). You set expectations on timeline, required work, and the deprecation date — and you make the new path easier than the old so adoption is pulled, not pushed.
Mentorship. A migration is how juniors level up: pair them on a slice, review their code to the fleet bar, and let them own a rung of the ladder. The JD lists "technical mentorship" and "high-quality code reviews" as responsibilities, not nice-to-haves.

NRI / OCI hooks as a migration enabler (the Netflix angle)

Migrating workloads onto Kubernetes (gw-09) without forking it is itself a migration problem. Netflix uses containerd NRI plugins and OCI hooks to inject specialized networking, storage, and sidecar behavior per workload at the runtime layer — so they get bespoke behavior and stay on standard Kubernetes. The lesson generalizes: find the extension point that lets you migrate incrementally without forking the platform, instead of a big-bang rewrite.

4. Core terminology

Term	Definition
Migration ladder	Dark → canary → sticky canary → ramp → cutover → decommission.
Shadow / mirror traffic	Copy prod traffic to the new path without serving its responses; zero user risk.
Dark launch	Run the new code in prod without exposing its effects.
Canary	A small slice of real traffic on the new path, compared to a control group.
Sticky canary	Keep the same cohort on the canary for consistent measurement.
Automated canary analysis (ACA)	Statistical compare + auto promote/rollback (e.g. Kayenta).
Ramp	Staged increase of new-path traffic with guardrails.
Soak	A stability period at full traffic before decommissioning the old path.
Reversibility	The property that you can roll back instantly (flag, warm old path).
Blast radius	How much breaks if the change is wrong; staged rollout shrinks per-step radius.
PodDisruptionBudget	K8s cap on simultaneous voluntary disruptions (don't drain too much).
NRI / OCI hooks	Runtime extension points to migrate workloads onto K8s without forking it.
Design doc / review	The written plan + the meeting that pressure-tests it.

5. Mental models

A migration is replacing a plane's engine in flight. You can't land (take downtime). So you add the new engine alongside, test it without relying on it (shadow), switch one of four to it (canary), then the rest gradually — keeping the old one spinning until you're sure (soak). Nobody on board should feel a thing.
Shadow traffic is a flight simulator fed by real weather. You fly the new system through actual production conditions, but a crash hurts no one because the real passengers are on the other plane. It's the cheapest, safest evidence you can buy.
The flag is the seatbelt; the warm old path is the parachute. The flag lets you stop instantly; keeping the old path alive means stopping actually lands you somewhere safe. A rollback plan that requires a deploy isn't a rollback — it's a hope.
Adoption is pulled, not pushed. Teams migrate when the new path is better for them (simpler, faster, safer), not because you sent a deadline email. The best migration leaders make the new path the path of least resistance, then the deadline is a formality.
The blast radius is a dial, not a switch. Region-by-region, cluster-by-cluster, 1%→100% — every rung turns the dial up a little and gives you a chance to turn it back. Big-bang is the switch that has no "back."

6. Common misconceptions

"If it passes staging/tests, just flip it." Staging never matches prod's scale, traffic mix, or weird inputs. Shadow + canary against real traffic is the only trustworthy validation for an edge change.
"Rollback = redeploy the old version." Too slow when the edge is bleeding. Rollback must be a flag/config flip with the old path already warm — seconds, not a deploy cycle.
"Faster is safer (rip the band-aid)." For a critical fleet, slow and staged is safer: each rung limits blast radius and gives the guardrails time to catch a regression before it's global.
"The migration is done at 100%." It's done after the soak and decommission: old path removed, flags deleted, runbooks/dashboards updated, partners off the deprecated path. Half-finished migrations (two code paths forever) are a tax everyone keeps paying.
"It's a technical project." The technical part is often the easy part. Stakeholder alignment, partner adoption, and the design review are where migrations stall or fail. Treat the org work as first-class.
"Canary = watch a dashboard at 1%." Manual eyeballing doesn't scale and isn't statistically sound. Use automated canary analysis with SLO gates and auto-rollback.

7. Interview talking points

"Tell me about a large-scale migration you led." The behavioral centerpiece. Structure: the goal + why it mattered → the constraints (zero downtime, N stakeholders) → the rollout ladder you chose (shadow → canary → ramp → soak) → the guardrails (SLO-gated auto-rollback) → the surprise that broke → how data drove the call → the outcome (measured) → what you'd do differently. Mirror the shape of the Netflix talks. Use gw-04/gw-05 as concrete models if you lack a story of your own scale yet.
"How would you roll out a risky new gateway behavior to the whole fleet?" The ladder: flag it → shadow/mirror and diff → sticky canary with automated analysis → ramp by region/cluster with SLO+NACK-gated auto-rollback → soak → decommission. Name the blast-radius staging and the warm-old-path rollback.
"How do you validate a change without risking users?" Traffic mirroring at the gateway: real prod inputs hit the new path, responses discarded/diffed, users served from the old path. Zero user risk, production-fidelity evidence. (Envoy mirrors natively; gw-03 forks in the endpoint filter.)
"How do you get other teams to migrate?" Make the new path strictly better (easier/faster/safer) so adoption is pulled; set clear expectations and a deprecation date; provide tooling/automation for the migration; and report progress transparently. Push only as a last resort.
"A canary looks slightly worse — promote or roll back?" Default to the SLO/error-budget math (gw-11): if the canary degrades the SLI beyond noise, roll back and investigate; don't "push through" on a hunch. Decisions are data-driven, which is exactly what the JD asks.
"How do you move workloads onto Kubernetes without a big-bang rewrite?" Find the extension point (NRI/OCI hooks, gw-09) that lets you migrate incrementally while staying on standard K8s; run old and new substrates side by side; migrate cluster by cluster. The principle: prefer an incremental path through an extension point over a fork.
"What does 'done' mean for a migration?" Soaked at 100%, old path decommissioned, flags removed, runbooks/dashboards/alerts updated, partners fully off the deprecated path. Not "it's at 100%."

8. Connections to other labs

Everything. This capstone is how you ship gw-01…gw-11 to a real fleet. Each prior lab supplies a piece:
- gw-03 — the data plane you're migrating (Zuul 1→2 is the archetype).
- gw-04 / gw-05 — staged rollout, pre-warming, jitter, and drain are the blast-radius controls; both are real Netflix migrations.
- gw-06 — resilience guardrails keep a bad canary from cascading.
- gw-08 / gw-10 — flags/config and CRD versioning are the migration machinery (dynamic push, conversion webhooks); xDS canaries by node.
- gw-09 — PodDisruptionBudgets, drain ordering, and NRI/OCI hooks make the substrate migration safe.
- gw-11 — SLOs, error budgets, and canary analysis are what gate every rung of the ladder.
db-16…20 (consensus) — staged, ordered, reversible state change with quorum-style safety is the same mindset as committing a replicated log; a migration is "change the cluster's state without ever being in an unsafe intermediate state."

gw-12 — The Hitchhiker's Guide to Leading a Large-Scale Gateway Migration

Companion to CONCEPTS.md, with the runnable rollout engine in src/go/rollout/. This is the capstone: it's how you ship gw-01…gw-11 to a real fleet without anyone noticing. "Evidence of leading large-scale migrations" is the JD's differentiator for a Distributed Systems Engineer 5.

There's no new protocol here — the capstone is the method: taking a risky change to a critical, high-traffic system and shipping it safely, measurably, with instant rollback. Every Netflix talk in the JD is the story of one of these (Zuul 1→2, the churn rollout, Pushy's re-architecture, Titus→Kubernetes). Run bash scripts/verify.sh:

=== healthy rollout ===
  shadow  mirrored traffic; no user impact
  canary    1% ... -> PROMOTE
  ramp      5% ... -> PROMOTE
  ramp     25% ... -> PROMOTE
  ...
  => reached 100%, rolledBack=false
=== rollout that breaches SLO at 25% ===
  ramp     25%  canErr=0.080 -> ROLLBACK
          auto-rollback to 5% (old path still warm)
  => reached 5%, rolledBack=true
=== shadow validation (new path differs on 1% of requests) ===
  mirrored=10000 diffs=100 diffRate=1.00%  (caught before any user was exposed)

1. The migration ladder (rollout.go)

The four pillars of any survivable migration are reversibility, validate-before-commit, gradual-ramp-with-guardrails, and stakeholder alignment. The first three are mechanized in Run, which walks the ladder lowest-risk to highest-commitment:

shadow (0%) → canary (1%) → ramp (5% → 25% → 50%) → full (100%)

TestLadderRampsToFullWhenHealthy: when the canary matches the baseline at every stage, it ramps all the way to 100%. TestLadderRollsBackOnSLOBreach: when the new path breaches the SLO at 25% (a bug only visible under real load — the most dangerous kind), it auto-rolls-back to the last good stage (5%) and stops. You never skip rungs; each one buys evidence that the next is safe.

The crucial property is in the rollback line: "old path still warm." Reached tracks the last safely-promoted percent, and rollback returns there instantly because the old path was never decommissioned. A rollback that requires a redeploy isn't a rollback — it's a hope. Reversibility is the whole game.

2. Automated canary analysis (rollout.go)

Analyze is the SLO gate: it compares the canary's golden signals (gw-11) to the baseline and decides PROMOTE or ROLLBACK. TestAnalyzePromoteWhenHealthy, ...RollbackOnErrors, and ...RollbackOnLatency pin the three outcomes: within tolerance → promote; error rate exceeds the delta → rollback; tail latency exceeds the ratio → rollback.

The key word is automated. A human watching a dashboard at 1% can't gate a fleet-wide ramp safely — they're too slow and not statistically rigorous. Real systems use Kayenta-style statistical comparison (the lab uses simple thresholds to make the mechanics legible). The decision is data-driven, exactly the JD's "identify root causes using data" — you don't "push through" a degraded canary on a hunch.

3. Shadow traffic: zero-risk validation (shadow.go)

Before any user touches the new path, you mirror production traffic to it, discard (or diff) the response, and serve the user from the old path. Shadow runs n requests through both and counts diffs. TestShadowDiff catches a new path that differs on 1% of requests; the important test is TestShadowZeroRiskWhenNewPathBroken: even when the new path always returns garbage, all 50 users are still served by the old path — the shadow only compares, never returns. That's the property that makes shadowing the cheapest, safest pre-canary validation: you fly the new system through real production weather, but a crash hurts no one.

The gateway is the perfect place to mirror (Envoy does it natively, gw-08; your gw-03 endpoint can fork the request). The diffing — does the new path produce acceptable output? — is where real bugs surface before exposure.

4. The org side (where migrations actually fail)

The code mechanizes the technical ladder; the hard part is people, and it's what the JD lists as core duties ("driving alignment with stakeholders," "setting partner expectations," "technical mentorship," "high-quality design/code reviews"). The CONCEPTS file and docs/analysis.md give the playbook:

Write the design doc (goals, non-goals, alternatives, rollout + rollback plan, risks accepted) and run the review as the author.
Make the new path strictly better so adoption is pulled, not pushed; set a clear deprecation date; provide migration tooling.
"Done" means soaked + decommissioned, flags removed, runbooks updated, partners off the old path — not "it's at 100%."

The capstone exercise in docs/analysis.md asks you to take one gw-* change (e.g. gw-01→gw-04 pooling) through this exact ladder with a written design doc — the artifact a "5" is judged on.

5. NRI/OCI hooks: migrating onto Kubernetes without forking it

The Netflix "Container Runtime Customization" talk is a migration lesson: they moved workloads onto standard Kubernetes by using containerd NRI plugins and OCI hooks to inject per-workload networking/storage/sidecar behavior at the runtime layer — bespoke behavior without forking Kubernetes. The general principle for any platform migration: find the extension point that lets you migrate incrementally, instead of a big-bang rewrite or a fork. (gw-09 covers the K8s networking substrate this rides on.)

6. Hands-on

cd src/go
bash ../scripts/verify.sh        # tests + the healthy/breaching/shadow demos
go run ./cmd/rolloutsim

Then do the capstone (docs/analysis.md): drive a real gw-* change through shadow → canary → ramp using the gw-11 metrics as the gate, with a written design doc and a rehearsed rollback.

7. Exercises

Statistical canary: replace threshold Analyze with a Mann- Whitney/CI-based comparison over samples (Kayenta-style); show it tolerates noise but catches a real regression.
Sticky canary: keep the same cohort on the canary across stages (consistent measurement) instead of re-sampling per request.
Blast-radius staging: extend the ladder to roll out region-by- region / cell-by-cell with a PodDisruptionBudget-style cap (gw-09) so you never drain too much at once.
Wire to the phase: gate the ladder on real gw-11 RED metrics from a gw-03 gateway running a gw-04 change; auto-rollback via a gw-08 config flip.
Game day: rehearse the rollback before the ramp — trigger a synthetic breach and confirm the auto-rollback returns to the warm old path with zero errors.

gw-12 — References

Netflix migrations (the talks, as migration case studies)

Evolution of Edge @ Netflix / Zuul 2: The Netflix Journey to Async — the Zuul 1→2 migration (gw-03). https://netflixtechblog.com/zuul-2-the-netflix-journey-to-asynchronous-non-blocking-systems-45947377fb5c
Curbing Connection Churn in Zuul — a fleet-wide data-plane behavior migration (gw-04). https://netflixtechblog.com/curbing-connection-churn-in-zuul-2feb273a3598
Pushy to the Limit — re-architecting a stateful fleet + density migration (gw-05). https://netflixtechblog.com/pushy-to-the-limit-evolving-netflixs-websocket-proxy-for-the-future-b468bc0ff658
Managing Netflix's Compute with Kubernetes / Titus — the substrate migration (gw-09). https://queue.acm.org/detail.cfm?id=3158370
Container Runtime Customization at Netflix (NRI & OCI Hooks) — migrating workloads onto K8s without forking it. https://github.com/containerd/nri

Rollout tooling & technique

Netflix Kayenta / Automated Canary Analysis (Spinnaker) — the statistical canary-compare that gates ramps. https://github.com/spinnaker/kayenta
Netflix TechBlog — Automated Canary Analysis at Netflix with Kayenta and the sticky-canary posts.
Argo Rollouts / Flagger — open-source progressive delivery (canary, blue-green, analysis, auto-rollback) for Kubernetes.
Envoy request mirroring (request_mirror_policies) — shadow traffic at the data plane (gw-08). https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/traffic_shadowing

Reliability & org practice

Google SRE Book / Workbook — "Release Engineering," canarying, error budgets governing release velocity, change management. https://sre.google/workbook/canarying-releases/
Accelerate (Forsgren/Humble/Kim) — why small, frequent, reversible changes outperform big-bang.
Will Larson, An Elegant Puzzle / Staff Engineer — leading cross-team technical work, design reviews, driving alignment.
Netflix culture deck — "context not control," "highly aligned, loosely coupled," freedom & responsibility (the stakeholder-alignment frame).

Cross-lab dependencies

Upstream: all of gw-01…gw-11 (the things you migrate and the guardrails that make it safe).
This is the capstone; there is no downstream — it's how the phase ships to production.

gw-12 — Analysis & Capstone Exercise

This capstone is a method, not a new protocol. The "analysis" is a written migration plan you produce, plus a runnable exercise that exercises the ladder against the labs you built.

The capstone exercise

Pick one migration and take it through the full ladder using your own gw-* implementations:

Migrate origin calls from "connect-per-request" (gw-01) to "pooled + subsetted" (gw-04) across a simulated fleet — with zero dropped requests and a measured churn reduction — using shadow → canary → ramp → soak, gated by the gw-11 metrics, driven by the gw-08 control plane, with instant rollback.

Deliverables:

A design doc (the artifact a "5" is judged on). Sections:
- Goal / non-goals — "cut origin connection churn ≥5×; do not change request semantics."
- Current state / target state — gw-01 path → gw-04 path.
- Alternatives considered — h2-only origins? bigger pools? — and why subsetting+pools.
- Rollout plan — the ladder, with per-rung success criteria.
- Rollback plan — flag flip; old Transport kept warm.
- Blast-radius staging — by simulated region/cluster/node group.
- Risks accepted — e.g. per-loop pools raise idle connections; subset size vs resilience headroom.
- Observability — exactly which gw-11 metrics gate each rung.
- Stakeholders & comms — who's affected, what they must do, the timeline and deprecation date.
A runnable rollout harness that:
- routes a configurable % of traffic to the new path (gw-03 routing + gw-08 flag push);
- shadows 100% to the new path while serving from the old, and diffs responses (status/latency) — zero user risk;
- runs an automated canary check: compare new-vs-old RED metrics (gw-11); auto-rollback if the canary's error rate or p99 regresses beyond a threshold;
- ramps 1% → 5% → 25% → 100% with a soak at each rung;
- proves the churn drop (gw-04 connections.created.rate) with no increase in 5xx through the whole ramp.

What "done" looks like

A before/after plot: churn collapses (gw-04), p99 and error rate unchanged across the ramp (gw-11).
A demonstrated instant rollback: flip the flag mid-ramp, traffic returns to the old path with no errors.
A design doc that someone could actually run a review against.

Leadership rubric (how a "5" is assessed on this)

Dimension	Evidence in your capstone
Ownership	you scoped, built, rolled out, and verified end-to-end
Judgment	reversible-by-default, staged blast radius, data-gated decisions
Risk management	shadow before serve; warm rollback; soak before decommission
Communication	a design doc that names non-goals, alternatives, and accepted risks
Stakeholder alignment	a comms plan + deprecation timeline, adoption-by-pull
Mentorship	a slice scoped for a junior to own, with a review plan
Data discipline	every promote/rollback decision tied to a gw-11 metric

Tradeoffs worth flagging (the senior reflexes)

Speed vs safety. Healthy error budget (gw-11) buys faster ramps; a burning budget means slow down or freeze. The budget is the dial.
Two code paths are a tax. Every rung you don't finish leaves dual paths to maintain; plan the decommission, don't let it rot.
Shadow has limits. It validates reads/idempotent paths cleanly; for writes you need careful diffing or a separate strategy (the mirrored write must not cause real side effects).
Stakeholders are the critical path. The technical rollout is often ready before the partner teams are; the migration leader's real job is unblocking them.

What production adds beyond this exercise

Statistical automated canary analysis (Kayenta-style) rather than a threshold check.
Region/cell-aware staged rollout with PodDisruptionBudgets and surge control (gw-09) so you never drain too much at once.
Org tooling: migration dashboards, partner self-service, automated deprecation tracking.
A real rollback rehearsal ("game day") before the ramp, not just a rollback plan on paper.

gw-12 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd gw-12-capstone-gateway-migration && bash scripts/verify.sh   # tests + rollout demos

Per-language workflow (Go)

cd gw-12-capstone-gateway-migration/src/go
go test -race -count=1 ./...        # canary analysis, ladder ramp/rollback, shadow
go run ./cmd/rolloutsim            # healthy ramp + SLO-breach rollback + shadow diff

Package map

File	What
`rollout/rollout.go`	metrics, canary analysis (SLO gate), the migration ladder with auto-rollback
`rollout/shadow.go`	zero-risk shadow/mirror validation + diff rate
`cmd/rolloutsim`	healthy rollout, SLO-breach rollback, and shadow validation demos

The capstone

docs/analysis.md defines the capstone exercise: take one gw-* change (e.g. gw-01 connect-per-request → gw-04 pooled+subsetted) through shadow → canary → ramp → soak, gated by gw-11 metrics, driven by gw-08 config, with a written design doc and a rehearsed rollback. See GUIDE.md for the deep dive (incl. the org/stakeholder side and NRI/OCI as a migration enabler).

gw-12 — Verification

One command

cd gw-12-capstone-gateway-migration && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestAnalyzePromoteWhenHealthy`	a canary within error/latency tolerance is promoted
`TestAnalyzeRollbackOnErrors`	an error-rate breach beyond the delta rolls back
`TestAnalyzeRollbackOnLatency`	a tail-latency breach beyond the ratio rolls back
`TestLadderRampsToFullWhenHealthy`	a healthy rollout ramps to 100%
`TestLadderRollsBackOnSLOBreach`	an SLO breach at 25% auto-rolls-back to the last good stage (5%)
`TestShadowDiff`	shadow detects new-vs-old response diffs (1% here)
`TestShadowZeroRiskWhenNewPathBroken`	even a fully-broken new path never affects served users (shadow only compares)

All under -race.

Demo (rolloutsim, in verify.sh)

a healthy rollout reaching 100%,
a rollout that breaches at 25% and auto-rolls-back to 5% (old path warm),
shadow validation catching diffs before any user is exposed.

What "green" does NOT guarantee

Threshold canary, not statistical. Production uses Kayenta-style analysis (exercise §7.1).
No blast-radius staging by region/cell. Region-by-region rollout + PodDisruptionBudgets is an exercise (§7.3).
Not wired to the live phase. Gating on real gw-11 metrics from a gw-03+gw-04 stack and rolling back via gw-08 is the capstone exercise (docs/analysis.md, GUIDE §7.4).
The org work is on you. The design doc, stakeholder alignment, and rehearsed rollback (the actual "5" differentiators) are described in CONCEPTS / analysis, not code.

Phase 7 — Platform & Distributed Systems Architecture

Target role: Software Architect – Distributed Systems & Platform Engineering, Apple (CAD Infrastructure Development).

Phases 1–5 build the systems behind a request (storage, consensus). Phase 6 builds the systems in front of a request (the cloud gateway). Phase 7 is about the systems between services — how you decompose a platform into services, how those services talk (sync and async), how you partition and keep data consistent, how you ship and operate it all (IaC, GitOps, SLOs), and — because this is an architect role — how you make and land the technical decisions that let other engineers move fast.

This is the "10 years + ownership of architecture" tier. The labs build the runnable substrate (event buses, a partitioned log, the outbox/saga patterns, consistent hashing, an IaC engine, a GitOps reconciler, an SLO engine), but the meta-skill the role is hiring for is judgment: choosing the right pattern, writing it down (ADRs), and building consensus. Every lab carries both.

Why this phase exists (and what it reuses)

Much of the Apple JD is already covered earlier in the book; Phase 7 cross-references those rather than rebuilding them, and builds the genuinely new platform-architecture topics.

The role wants…	Where it lives
Microservices decomposition & service contracts	pa-01 (new)
API design across REST, gRPC, events	pa-02 (new)
Event-driven architecture & async patterns	pa-03 (new)
Message queues / streaming (Kafka, NATS, Pulsar, RabbitMQ)	pa-04 (new)
Delivery semantics, outbox, sagas, fault tolerance	pa-05 (new)
Consistency models, partitioning strategies	pa-06 (new) + db-16
Infrastructure-as-Code (Terraform, Pulumi)	pa-07 (new)
GitOps (ArgoCD, Flux) & progressive delivery	pa-08 (new) + gw-12
SLOs, reliability eng, circuit breakers	pa-09 (new) + gw-06, gw-11
Architecture, design/code review, mentoring, consensus	pa-10 (new)
Kubernetes-native: operators, CRDs	gw-10 (covered)
Service mesh	gw-08 (covered)
Distributed tracing	gw-11 (covered)
Consensus, replication, fault tolerance	db-16…20 (covered)

Every new JD bullet maps to a Phase 7 lab; every overlapping bullet is already built elsewhere and linked.

How this phase is structured

Same proven shape as Phase 6 — each lab pa-NN ships:

pa-NN-name/
├── CONCEPTS.md      # the 8-part framework (the "why")
├── GUIDE.md         # the maintainer-level, hands-on deep dive
├── references.md    # papers, books, the canonical systems to study
├── docs/{analysis,execution,verification}.md
├── steps/           # staged, code-rich implementation guides
├── scripts/verify.sh
└── src/go/          # REAL, compilable, `go test -race`-green Go (stdlib-only, offline)

The runnable code you'll build and hack on:

a service-dependency analyzer (cycle detection, blast radius, layering rules) — pa-01,
a contract-compatibility checker (protobuf-style wire rules) + idempotency store + cursors — pa-02,
an event bus with at-least-once delivery, idempotent consumers, and a dead-letter queue — pa-03,
a partitioned commit log (Kafka in miniature): partitions, offsets, consumer groups, rebalancing — pa-04,
the transactional outbox and a crash-recoverable saga orchestrator — pa-05,
consistent hashing (vnodes, minimal movement) + a quorum simulator — pa-06,
a mini IaC engine: dependency DAG, plan/diff, apply, state, drift detection — pa-07,
a GitOps reconciler: sync, prune, self-heal, sync waves — pa-08,
an SLO + multi-window burn-rate alerting engine + bulkheads — pa-09,
architecture fitness functions (automated architecture tests) — pa-10.

Run the whole phase: bash pa-00-platform-architecture-overview/verify-all.sh.

What "Software Architect" actually means here

This is not "senior engineer + 2." An architect is measured on leverage: decisions that let many teams move faster and safer.

You make cross-cutting decisions and write them down. ADRs / RFCs that capture context, options, trade-offs, and the decision — so the why survives you. (pa-10)
You design for evolution, not perfection. Service boundaries and contracts that can change independently; the "distributed monolith" is the cardinal failure. (pa-01, pa-02)
You enable, you don't gatekeep. Patterns, paved roads, and automated fitness functions so the right thing is the easy thing — not a review bottleneck. (pa-10)
You build consensus. The hardest part is aligning teams with different incentives. Design reviews, prototypes, and data — not authority. (pa-10)
You own the trade-offs. CAP/PACELC, sync vs async, consistency vs availability, build vs buy. You can defend each choice with numbers. (pa-06, pa-03, pa-09)

The Apple CAD-infrastructure context: you're building the internal platform a large hardware/silicon engineering org runs on — job/data services, event pipelines, and the paved roads other teams build on. The JD is generic distributed-systems architecture; the labs keep that generality.

A note on languages

The JD says Go, Java, or Python. Phase 7's runnable code is Go (stdlib-only, so it builds offline) — the platform/infra lingua franca (Kubernetes, Terraform, etcd, NATS, most operators). The patterns are language-agnostic; the GUIDEs note where Java (Spring, Kafka Streams) or Python idioms differ.

Suggested path

pa-01 (decomposition) ─▶ pa-02 (API/contracts) ─▶ pa-03 (event-driven)
                                                        │
                                  ┌──────────────────────┼───────────────────┐
                                  ▼                      ▼                    ▼
                           pa-04 (log)           pa-05 (outbox/saga)   pa-06 (partitioning)
                                  │
        ┌──────────────────────────┼──────────────────────────┐
        ▼                          ▼                           ▼
  pa-07 (IaC)              pa-08 (GitOps)              pa-09 (SLOs/reliability)
                                                              │
                                                  pa-10 (architecture in practice)

Read HITCHHIKERS-GUIDE.md first, then INTERVIEW.md for the architect-level system-design playbook, behavioral mapping, 30-60-90, and questions to ask them.

The Hitchhiker's Guide to Platform Architecture

A warm-up primer for Phase 7. Read this first. It builds the mental models an architect carries, shows how the runnable labs fit together, and gives you the throughline that turns ten labs into one coherent story you can defend in an architecture interview.

Don't panic. By the end you'll have built — in real, tested, runnable Go — an event bus, a partitioned commit log, the transactional outbox, a saga orchestrator, consistent hashing, an IaC engine, a GitOps reconciler, and an SLO/burn-rate engine. But the deeper goal is the judgment the architect role is hiring for, so each lab pairs the code with the trade-off behind it.

1. The four questions every platform architecture answers

Every distributed platform — Apple's CAD infra or anyone's — is a set of answers to four questions. Phase 7 is organized around them.

1. WHERE ARE THE BOUNDARIES?   how do we split this into services?      pa-01, pa-02
2. HOW DO SERVICES TALK?       sync request/reply vs async events       pa-02, pa-03, pa-04, pa-05
3. WHERE DOES THE DATA LIVE?   partitioning + consistency model         pa-06
4. HOW DO WE SHIP & OPERATE?   IaC, GitOps, SLOs, and decision-making   pa-07, pa-08, pa-09, pa-10

If you can answer those four for a system — with the trade-offs named — you can architect it. The labs make each answer concrete and runnable.

2. The one trade-off that dominates: sync vs async

More than any other choice, synchronous vs asynchronous communication shapes a platform's failure behavior, and architects are tested on it constantly.

SYNC (REST/gRPC, pa-02):   A calls B and waits.
  + simple, immediate answer, easy to reason about.
  - TEMPORAL COUPLING: A is only as available as B; latency compounds
    down the chain; one slow dependency cascades (you need breakers,
    bulkheads, timeouts — gw-06, pa-09).

ASYNC (events/log, pa-03/pa-04):  A emits an event; B consumes when ready.
  + decoupled in time and availability; load-leveling; fan-out; audit log.
  - EVENTUAL CONSISTENCY: ordering, idempotency, duplicate delivery, and
    "where did my event go" become YOUR problems (pa-05).

There is no free lunch — you move complexity, you don't remove it. The architect's job is to put each edge on the right side of that line and own the resulting failure model. Labs pa-03/04/05 build the async side so its costs (delivery semantics, the dual-write trap, sagas) are concrete, not abstract.

3. The distributed-monolith trap (the cardinal sin)

The most common failed "microservices" architecture is a distributed monolith: services that look independent but must deploy together, share a database, or make a synchronous call per request. You've taken a monolith and added network latency, partial failure, and serialization overhead — strictly worse.

The antidotes run through the whole phase:

Boundaries by capability (bounded contexts), each independently deployable and owning its data (pa-01).
Contracts that version independently so a producer change doesn't force a lockstep consumer deploy (pa-02).
Async where you can to break temporal coupling (pa-03/04).
The outbox so a service owns its data and its events without a shared DB or a dual write (pa-05).
Fitness functions that fail CI when a cycle or a forbidden dependency sneaks in (pa-10).

pa-01's servicegraph literally detects the structural symptoms (cycles, high coupling) in code.

4. The throughline back to the rest of the book

Phase 7 stands on the consensus and gateway work:

Reconciliation everywhere. The IaC engine (pa-07), the GitOps reconciler (pa-08), the Kubernetes operator (gw-10), and the xDS control plane (gw-08) are the same level-triggered, idempotent, converge-to-desired-state loop. Once you see it four times, it's a reflex: declare desired state, diff against actual, converge safely.
Delivery semantics are a consistency problem. at-least-once + idempotent consumers (pa-03/05) is the practical face of the same exactly-once-is-a-myth lesson from gw-05's push delivery.
Partitioning + quorums (pa-06) is db-17's majority argument and gw-04's subsetting ring, applied to data placement.
SLOs/error budgets (pa-09) extend gw-11's observability into the reliability-engineering discipline that governs how fast you ship.

You are not learning a new field. You're learning to compose the pieces into a platform and to own the trade-offs.

5. How Phase 7 is built (and how to use it)

Every lab ships real, go test -race-green, stdlib-only Go (builds offline) plus a maintainer-level GUIDE.md, CONCEPTS.md, references.md, docs/{analysis,execution,verification}.md, code-rich steps/, and scripts/verify.sh. Work a lab like this:

Read CONCEPTS.md (the why + the 8-part framework).
Open GUIDE.md next to src/go/ and read the code it walks you through.
Run bash scripts/verify.sh and read the test names — they're the spec.
Run the demo CLI to see the headline result.
Do the exercises (they're the interview).

Verify the whole phase:

bash pa-00-platform-architecture-overview/verify-all.sh

6. The headline results you can reproduce

Lab	`verify.sh` shows
pa-01	a dependency cycle detected + a service's blast radius computed
pa-03	an event fan-out, a poison message landing in the DLQ after N retries, idempotent dedup
pa-04	per-partition ordering, a key always hitting the same partition, consumer-group offset resume + rebalance
pa-05	the outbox surviving a crash (at-least-once), a saga compensating on failure and resuming after a crash
pa-06	consistent hashing moving ~1/N keys on a node change vs ~all for mod-N; quorum `R+W>N` overlap
pa-07	a plan diff, a topologically-ordered apply, an idempotent re-apply (no-op), and drift detected
pa-08	GitOps sync, prune, and self-heal correcting manual drift
pa-09	a fast burn-rate alert firing on real budget burn but not on a blip
pa-10	a fitness function failing CI on a forbidden dependency / cycle

7. Suggested path

pa-01 ─▶ pa-02 ─▶ pa-03 ─▶ pa-04 ─▶ pa-05      [the comms + data spine, in order]
                                  │
        ┌──────────────────────────┼──────────────────────────┐
        ▼                          ▼                           ▼
  pa-06 (data)             pa-07 (IaC) ─▶ pa-08 (GitOps)   pa-09 (SLOs)
                                                              │
                                                  pa-10 (architecture in practice)

Now read pa-01's CONCEPTS. The first question every platform answers is "where are the boundaries?" — start there.

Phase 7 — Architect Interview Playbook

A Software Architect loop is not a senior-engineer loop with bigger numbers. The bar is judgment and leverage: can you make a cross-cutting decision, defend its trade-offs with data, write it down, and get many teams to adopt it? The rounds usually are:

System design at scale — "design a platform/service for X." They probe boundaries, contracts, data, failure, and evolution.
A deep architecture trade-off — "sync vs async here? strong vs eventual? build vs buy?" They want the reasoning, not the answer.
Architecture review of an existing/your design — find the risks, the coupling, the failure modes; propose the migration.
Behavioral / leadership — driving consensus, mentoring, owning a decision that was wrong, influencing without authority.
(sometimes) a coding/contract screen — an API/schema/idempotency problem.

1. The platform system-design playbook

Architect design questions reward structure + trade-off ownership. Use this spine.

Step 0 — Clarify the "-ilities" and the scale, out loud

Before drawing anything: "What are we optimizing — latency, throughput, consistency, cost, dev velocity? What's the scale (RPS, data size, teams, event rate)? What's the consistency requirement, and what can be eventual?" Architecture is prioritizing the -ilities; naming them first is the highest-signal move.

Step 1 — Decompose by bounded context, not by noun

Draw services along bounded contexts (cohesive business capabilities), not data tables. State the boundary criteria: high cohesion inside, loose coupling across, independent deployability, a team that can own it. Call out the distributed-monolith anti-pattern explicitly: services that must deploy together, share a DB, or chat synchronously per request are a monolith with network latency added. (pa-01)

Step 2 — Define the contracts and the communication style

For each edge, choose sync (request/reply) vs async (events) and justify it:

sync (REST/gRPC):  need an immediate answer; caller can't proceed without it.
                   cost: temporal coupling, cascading failure, you own the latency budget.
async (events):    fire-and-forget / fan-out / decoupling / load-leveling / audit.
                   cost: eventual consistency, ordering, idempotency, harder to reason about.

Name the API contract (REST resource + idempotency keys; gRPC/proto with backward-compat rules; or an event schema in a registry) and how it versions without breaking consumers. (pa-02, pa-03)

Step 3 — Data: partitioning, consistency, and the dual-write trap

State the partitioning strategy (hash / range / consistent hashing) and why; the replication + consistency model (linearizable? causal? read-your-writes? eventual?) per data set, justified by the -ilities; and how you avoid the dual-write problem when you must update a DB and publish an event — the transactional outbox, not a best-effort publish. For cross-service workflows, a saga with compensations, not a distributed 2PC. (pa-04, pa-05, pa-06)

Step 4 — Failure, reliability, and the SLO

Volunteer the failure modes and the budget: timeouts, retries with budgets + jitter, circuit breakers, bulkheads, backpressure, load shedding, graceful degradation. Frame reliability as an SLO + error budget (not "five nines everywhere") and say what degrades first. (pa-09, gw-06)

Step 5 — Ship it: IaC, GitOps, progressive delivery

"Infrastructure is declarative (Terraform/Pulumi), deployed via GitOps (git as source of truth, a reconciler that syncs + self-heals + prunes), with progressive delivery (canary/blue-green) gated on SLOs and automatic rollback." (pa-07, pa-08, gw-12)

Step 6 — Close with evolution

"I'd encode the key constraints as fitness functions (automated architecture tests — no dependency cycles, layering rules, contract compat in CI) so the design stays coherent as it grows, and capture the decision in an ADR." (pa-10)

The canonical questions, pre-solved

Question	The spine	Labs
"Design an event pipeline / streaming platform"	partitioned log → consumer groups → delivery semantics → outbox → DLQ	pa-04, pa-03, pa-05
"Design an internal developer platform / job system"	bounded-context services → contracts → async jobs via a log → IaC+GitOps → SLOs	pa-01, pa-04, pa-07, pa-08
"Decompose this monolith"	bounded contexts → strangler-fig → contracts → outbox for data → migrate by capability	pa-01, pa-05, gw-12
"Order service + payments, no lost/double charges"	saga + idempotency keys + outbox; never distributed 2PC	pa-05, pa-02
"Make this globally consistent / available"	CAP/PACELC; partition + quorum or eventual + causal; per-dataset choice	pa-06, db-16

2. The trade-off round (where architects are made)

They'll push on one decision. Win by showing the decision framework, not a memorized answer:

State the forces (the -ilities in tension).
Give 2–3 options with their costs, including the one you reject and why.
Recommend, tie it to the stated priorities, and name what would change your mind (the "we'd revisit if…").

Have crisp positions on: sync vs async (temporal coupling vs eventual consistency), strong vs eventual consistency (CAP/PACELC, per dataset), orchestration vs choreography (a central saga vs emergent events — visibility vs coupling), build vs buy, shared library vs shared service (deploy coupling vs network coupling), and monorepo vs polyrepo. Each lab's docs/analysis.md is a model of this "tradeoffs worth flagging" thinking.

3. The architecture-review round

Given a design, find: synchronous call chains that cascade (no bulkhead/breaker), dual writes, shared databases across services, missing idempotency, unbounded retries, ordering assumptions on a partitioned log, a single consistency model forced on everything, and no rollback path. Then propose an incremental migration (strangler-fig, outbox-then-cutover) — never a big-bang rewrite (gw-12).

4. Behavioral / leadership (the architect differentiators)

Prepare stories tagged to these — Apple and any architect loop probe them:

Dimension	What they listen for	Story about…
Influence without authority	you aligned teams you don't manage	a cross-team standard/migration you drove
Owning a decision	you made the call with incomplete info and owned the outcome	a reversible call made fast; an irreversible one made carefully
A wrong decision	intellectual honesty + how you corrected	an architecture you had to walk back, and what you learned
Mentoring / leverage	you made others better, not just shipped	raising the bar via review or a paved road
Simplicity	you removed complexity, killed a service, said no	a design you made smaller
Data over opinion	you let a prototype/metric settle a debate	a contested decision resolved by evidence

5. The 30-60-90 (say this if asked)

0–30 — Map the terrain. Read the top services, their contracts, and the data/event flows; draw the real (not the wiki) architecture; meet the teams and learn their pain. Ship one small paved-road improvement to earn trust.
30–60 — Pick a high-leverage problem. Write the ADR for one cross-cutting decision (e.g., the eventing standard, the service-contract policy, the SLO framework); socialize it in design review; prototype it.
60–90 — Land it and templatize. Ship the pattern as a paved road + a fitness function in CI so it spreads without you; show a before/after metric (velocity, incident rate, or a killed service). Mentor an engineer to own the next slice.

Thread: reduce coupling, increase leverage, leave the architecture more evolvable and better-documented than you found it.

6. Questions to ask them (architect-grade)

"Where is the architecture today vs where you want it — and what's the biggest source of cross-team coupling or 'distributed monolith' pain?"
"How are service contracts governed and evolved — schema registry, compat checks in CI, or convention?"
"What's the eventing/streaming backbone, and how do you handle the dual-write problem and exactly-once expectations?"
"How do platform changes ship — IaC + GitOps + progressive delivery — and how mature is rollback/SLO-gating?"
"How does an architect actually drive a decision here — ADRs, an architecture council, RFCs? How is consensus reached and recorded?"
"What's the hardest distributed-systems trade-off the team is living with right now?" (Their answer is your future work.)

7. One-page cheat sheet

-ILITIES FIRST — name what you're optimizing + the scale, before drawing.
BOUNDED CONTEXTS — decompose by capability; beware the distributed monolith.
SYNC vs ASYNC — immediate answer vs decoupling; own the cost of each.
CONTRACTS — REST+idempotency / gRPC+proto-compat / event schema+registry; version safely.
DATA — partition (hash/range/consistent-hash) + consistency PER dataset (CAP/PACELC).
DUAL WRITE — transactional outbox, not best-effort publish. Cross-service = saga, not 2PC.
DELIVERY — at-least-once + idempotent consumers; ordering only within a partition; DLQ.
RELIABILITY — SLO + error budget; timeouts/retries+budget+jitter; breakers; bulkheads; shed.
SHIP — IaC (declarative) + GitOps (reconcile/self-heal) + progressive delivery (SLO-gated rollback).
EVOLUTION — ADRs for the why; fitness functions in CI; paved roads; build consensus, don't gatekeep.

Phase 7 — References

The architect's reading list, grouped by theme. Per-lab references.md go deeper.

Architecture & design (the core craft)

Sam Newman, Building Microservices (2nd ed.) — decomposition, contracts, integration, the distributed-monolith trap.
Eric Evans, Domain-Driven Design + Vaughn Vernon, Implementing DDD — bounded contexts, context maps (the basis for service boundaries).
Neal Ford et al., Building Evolutionary Architectures — fitness functions, architecture as a continuously-tested property.
Martin Fowler's bliki — MicroservicePremium, EventDrivenArchitecture, CQRS, EventSourcing, StranglerFigApplication, IntegrationDatabase. https://martinfowler.com/
Software Architecture: The Hard Parts (Ford/Richards) — distributed data, sagas, contract coupling, decision records.
Michael Nygard, Documenting Architecture Decisions (the ADR origin). https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions

Distributed systems fundamentals

Martin Kleppmann, Designing Data-Intensive Applications — the one book to read: replication, partitioning, consistency, stream processing, the dual-write problem. Maps to pa-04/05/06.
CAP (Brewer) and PACELC (Abadi) — the availability/consistency and latency/consistency trade-offs.
Werner Vogels, Eventually Consistent; Pat Helland, Life Beyond Distributed Transactions and Data on the Outside vs Data on the Inside.

Event-driven, streaming & messaging

Apache Kafka docs — the partitioned log, consumer groups, offsets, delivery semantics, exactly-once. The model pa-04 builds. https://kafka.apache.org/documentation/
NATS / JetStream, Pulsar, RabbitMQ docs — contrast the messaging models (subjects vs partitions vs queues vs streams).
Confluent — Transactional Outbox, Saga pattern, Schema Registry + Avro/Protobuf compatibility. https://www.confluent.io/blog/
microservices.io (Chris Richardson) — Saga, Outbox, CQRS, API Composition patterns. https://microservices.io/patterns/

Reliability & operations

Google SRE Book / Workbook — SLIs/SLOs/error budgets, multi-window multi-burn-rate alerting, addressing cascading failures. https://sre.google/workbook/alerting-on-slos/
Michael Nygard, Release It! — circuit breakers, bulkheads, timeouts, the stability patterns (pairs with gw-06).
Marc Brooker's blog — retries, jitter, metastable failures.

Platform, IaC, GitOps

Terraform / Pulumi docs — declarative resources, the dependency graph, plan/apply, state, drift. The model pa-07 builds.
ArgoCD / Flux docs — git as source of truth, reconcile, sync waves, self-heal, prune, progressive delivery. The model pa-08 builds.
Team Topologies (Skelton/Pais) — Conway's Law, platform teams, the paved road (why an architect's leverage is org-shaped).

Cross-phase links (already in this book)

db-16…20 — consensus, replication, linearizability (pa-06 builds on).
gw-08 / gw-10 — service mesh (xDS) and operators/CRDs (covered).
gw-06 / gw-11 — circuit breakers/adaptive concurrency and tracing/SLO primitives (pa-09 builds on).
gw-12 — progressive delivery / migration ladder (pa-08 builds on).

pa-01 — Microservices Decomposition & Service Contracts

The first question every platform architecture answers is where are the boundaries? The Apple JD asks for "software architecture and systems design, including microservices decomposition and service contracts." Get the boundaries right and services evolve independently; get them wrong and you build a distributed monolith — services that must deploy together, share a database, or call each other synchronously per request. That's a monolith with network latency, partial failure, and serialization overhead added: strictly worse.

This lab makes decomposition measurable. You build a service-dependency analyzer that detects the structural symptoms of bad boundaries — cycles, high coupling, layering violations — and computes blast radius (who breaks if a service breaks). These are the numbers an architect brings to a decomposition review.

1. What is it?

Decomposition is splitting a system into services along bounded contexts (Domain-Driven Design): cohesive business capabilities with a clear responsibility, owned by one team, independently deployable, owning their own data. The boundary test is high cohesion inside, loose coupling across.

A service contract is the explicit, versioned interface a service exposes — its API (REST/gRPC, pa-02), its events (pa-03), and the guarantees around them (idempotency, ordering, compatibility). Contracts are how services stay decoupled over time: a producer can change internals freely as long as the contract holds.

The dependency structure of a platform is a directed graph (A→B means "A depends on B"). Its shape tells you whether your decomposition is healthy:

HEALTHY (acyclic, layered)        UNHEALTHY (cyclic, "distributed monolith")
   web    mobile                     web ──▶ orders ◀──┐
     \    /                                   │        │
     orders                                   ▼        │
     /    \                              inventory ─────┘   (cycle: can't deploy alone)
 inventory payments                          │
     \    /                                   ▼
    postgres                              postgres ◀── orders (domain→infra: layering violation)

2. Why does it matter?

It's the decision that's most expensive to get wrong. Code inside a service is cheap to refactor; a boundary between services is a network API, a team boundary, and a deploy boundary all at once. Architects are hired to get these right because changing them later is a migration (gw-12).
The distributed monolith is the #1 failed-microservices outcome. Teams split by technical layer or by data table, end up with services that can't deploy independently, and inherit all the costs of distribution with none of the benefits. Recognizing and measuring the symptoms (cycles, coupling, shared DBs) is core architect judgment.
Blast radius is a design and operational tool. Knowing that "if auth fails, these 14 services break" drives where you put bulkheads, circuit breakers (gw-06, pa-09), and async boundaries (pa-03). It's also how you scope an incident.
Contracts are how a platform scales to many teams. With good contracts and compatibility rules (pa-02), 50 teams ship independently; without them, every change is a cross-team coordination meeting. The architect's leverage is making independent evolution safe.

3. How does it work?

Boundary heuristics (what to draw a line around)

Signal a boundary is right	Signal it's wrong
one team can own it end to end	two teams constantly change it
changes rarely ripple to other services	every change forces N other deploys
it owns its data; others access via its API	services share a database
cohesive single capability	a "utils"/"common" service everything depends on
can be deployed independently	must deploy in lockstep with others

The structural analyses (what the code computes)

Cycle detection (Cycles, Tarjan SCC). A dependency cycle means those services cannot evolve or deploy independently — the definitional distributed monolith. A healthy decomposition is a DAG.
Blast radius (BlastRadius). Reverse-reachability: every service that transitively depends on X. The set that breaks if X breaks.
Coupling (FanIn/FanOut). High fan-in = a critical dependency (large blast radius; protect it). High fan-out = a fragile service (broken by many others; it's doing too much or is too chatty).
Layering rules (LayeringViolations). Architectural constraints like "domain logic must not depend on infrastructure" or "no service may depend on a higher layer," enforced over the graph. These become automated fitness functions in CI (pa-10).

Integration styles (how services should — and shouldn't — couple)

GOOD: API call (sync, pa-02) or event (async, pa-03) across a contract.
GOOD: each service owns its data; others go through its API.
BAD:  shared database (an integration database) — couples schemas + deploys.
BAD:  synchronous chains N deep — latency + cascading failure (pa-09).

Evolving boundaries: the strangler fig

You rarely get boundaries right up front. The strangler-fig pattern migrates incrementally: route a slice of traffic/functionality to the new service, grow it, and retire the old path — never a big-bang rewrite (gw-12). Contracts + the outbox (pa-05) make the data migration safe.

4. Core terminology

Term	Definition
Bounded context	A cohesive business capability with its own model and boundary (DDD).
Service contract	The explicit, versioned interface (API + events + guarantees) a service exposes.
Distributed monolith	Services that must deploy together / share data / call synchronously per request.
Cohesion / coupling	How related a service's responsibilities are / how dependent services are on each other.
Fan-in / fan-out	Number of services depending on X / number X depends on.
Blast radius	The set of services affected if a given service fails.
Cycle (SCC)	A group of services mutually (transitively) dependent — cannot deploy independently.
Integration database	An anti-pattern: multiple services sharing one database.
Layering rule	An architectural constraint on which layers may depend on which.
Strangler fig	Incrementally replacing a system by routing slices to the new one.
Conway's Law	System structure mirrors org communication structure; boundaries are socio-technical.

5. Mental models

Services are organs, not Lego bricks. A good boundary has high cohesion (an organ does one job) and a thin, well-defined interface (contracts = the bloodstream). Splitting by technical layer is like separating "all the left halves" from "all the right halves" — maximal coupling across the cut.
The dependency graph is an X-ray. You can't see coupling in a wiki diagram, but cycles, fan-in hotspots, and layering violations are visible in the graph. The analyzer is the X-ray machine; the architect reads the film.
Conway's Law is gravity. Your architecture will mirror your org chart. If two teams own one service, it'll fracture along the team line; if one service needs two teams to change it, that's a boundary in the wrong place. Design the boundaries and the team topology together.
A contract is a promise you can keep while changing your mind. It lets you rewrite a service's internals freely. The moment consumers depend on something not in the contract (a DB table, a response field's incidental order), you've lost the freedom and gained a distributed monolith.

6. Common misconceptions

"Microservices = good, monolith = bad." A well-structured monolith beats a distributed monolith every time. Microservices buy independent deployability and scaling at the cost of distribution complexity; pay it only when you need what it buys (Fowler's "microservice premium"). Many systems should start as a modular monolith.
"Split by technical layer (UI / logic / data services)." That maximizes cross-cutting coupling: every feature touches every layer, so every change is a multi-service deploy. Split by capability (orders, inventory, payments), each owning its full stack.
"Shared database is fine, it's just one team's data." An integration database couples schemas and deploys across services and destroys independent evolution. Each service owns its data; others use its API or its events (pa-05 outbox).
"Smaller services are always better." Nano-services explode the number of network hops, contracts, and failure modes. Size to a bounded context a team can own, not to "one function per service."
"We'll fix the boundaries later." Boundaries are the hardest thing to change later (they're migrations). Spend architect time here up front, and design for evolution (strangler fig) when you must move one.

7. Interview talking points

"How do you decide service boundaries?" Bounded contexts / capabilities, not technical layers or data tables. Test: high cohesion inside, loose coupling across, independently deployable, one team owns it, owns its data. Name the distributed-monolith anti-pattern and how you'd detect it (cycles, shared DBs, per-request sync chains).
"How do you know a decomposition has gone wrong?" Measurable symptoms: dependency cycles, a service with huge fan-in that everything waits on, services that always deploy together, shared databases, synchronous call chains. I'd put cycle/layering checks in CI as fitness functions (pa-10).
"What's blast radius and why compute it?" The set of services that break if one fails (transitive dependents). It tells you where to put bulkheads/breakers and async boundaries, and it scopes incidents. Reducing blast radius is a primary goal of decomposition.
"How do services stay decoupled as they evolve?" Explicit, versioned contracts (pa-02) with compatibility rules, owning their own data, and async events (pa-03) where temporal coupling is unacceptable. The contract lets a producer change internals without breaking consumers.
"Monolith → microservices: how?" Strangler fig: carve one bounded context at a time behind a contract, move its data with an outbox (pa-05), route a slice of traffic, verify, retire the old path. Never a big-bang rewrite (gw-12). Stop when the remaining monolith is fine — you don't have to split everything.
"When would you NOT use microservices?" Early-stage / small team / unclear domain boundaries: a modular monolith ships faster and lets you learn the real boundaries before paying the distribution tax.

8. Connections to other labs

pa-02 (API design) — contracts are how the boundaries you draw here stay decoupled over time.
pa-03 / pa-04 (events / log) — async communication breaks the temporal coupling that turns sync call-chains into distributed monoliths.
pa-05 (outbox) — how a service owns its data and publishes events without a shared DB or a dual write.
pa-10 (architecture in practice) — the cycle/layering checks here become automated fitness functions in CI; the servicegraph is reused.
gw-12 (migration) — strangler-fig boundary moves are migrations, run with the rollout ladder.
db-/gw- — the services you decompose are the storage engines, gateways, and consensus systems built earlier.

pa-01 — The Hitchhiker's Guide to Decomposition & Contracts

Companion to CONCEPTS.md, with the runnable analyzer in src/go/servicegraph/. Boundaries are the most expensive thing to get wrong; this lab makes the symptoms measurable.

Run bash scripts/verify.sh and the sgsim demo prints the X-ray of a slightly-unhealthy platform:

cycles (distributed-monolith smell):
  [inventory orders]  <- cannot deploy/evolve independently
blast radius (who breaks if X breaks):
  postgres   [inventory mobile orders payments web]
  orders     [inventory mobile web]
layering violations (domain must not depend on infra):
  orders (domain) -> postgres (infra)

That output is what an architect brings to a decomposition review: not opinions, evidence.

1. The dependency graph as an X-ray (graph.go)

A platform's health is visible in its dependency graph (A→B = "A depends on B"). Graph stores the edges and tags each service with a layer. The four analyses are the architect's instruments:

Cycles = the distributed-monolith detector

Cycles() runs Tarjan's strongly-connected-components algorithm and returns any SCC of size > 1 (plus self-loops). A cycle means those services are mutually dependent — they cannot deploy or evolve independently, which is the definition of a distributed monolith. TestCycleDetection builds a→b→c→a and confirms it's found; TestAcyclicHasNoCycle confirms a clean DAG is silent. A healthy decomposition is acyclic. (Tarjan is the same SCC machinery you'd use to find dependency cycles in a build graph or an import graph — worth knowing cold.)

Blast radius = reverse reachability

BlastRadius(svc) walks the edges backwards (transitive dependents): everything that breaks if svc breaks. TestBlastRadius shows that killing auth takes down api, web, and mobile, while a leaf has an empty radius. This is a design tool (where to put bulkheads/breakers, gw-06/pa-09, and async boundaries, pa-03) and an incident tool (scope the damage).

Coupling = fan-in / fan-out

FanIn (who depends on me) and FanOut (who I depend on) are afferent and efferent coupling. High fan-in = a critical chokepoint (large blast radius — protect it, make it rock-solid). High fan-out = a fragile service broken by many others (too chatty, or doing too much). The demo shows orders with fanIn=3, fanOut=3 — a hub worth scrutinizing.

Layering rules = architecture constraints

LayeringViolations(rules) flags edges that break a rule like "domain must not depend on infra." TestLayeringViolations catches order-domain → postgres-adapter. These constraints are exactly what an architect encodes as a fitness function in CI (pa-10) so the rule is enforced automatically forever, not policed by hand in reviews.

2. Why this is the architect's first move

You cannot architect what you cannot see. A wiki diagram lies; the real graph (derived from code imports, call traces, or a service registry) doesn't. Feeding that real graph through these four analyses turns "I think our boundaries are getting muddy" into "we have 3 cycles, a fan-in-of-40 chokepoint, and 7 domain→infra violations — here's the plan." That evidence is how you build consensus (pa-10) for a decomposition change.

3. From analysis to action

The graph tells you what's wrong; the patterns tell you how to fix it:

A cycle → break it with an async event (pa-03) so the back-edge becomes "publish and forget" instead of a synchronous dependency, or extract the shared concept into its own service both depend on.
A huge fan-in chokepoint → it's a critical dependency; harden it (SLOs, bulkheads — pa-09) and consider whether callers really need it synchronously or could consume events.
A layering violation → invert the dependency (depend on an interface the domain owns; the infra adapter implements it) and add a fitness function so it can't regress.
A shared database → give each service its own store; integrate via API or the outbox (pa-05).
Moving a boundary → strangler fig + the migration ladder (gw-12).

4. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/sgsim

Then point it at your system: emit your services' import/call graph as AddDependency edges (many languages can dump this), tag layers, and run the analyses. The cycles and chokepoints you find are your architecture backlog.

5. Exercises

Feed a real graph: generate edges from a codebase's import graph (or a distributed-trace service map) and find the real cycles + chokepoints.
Break a cycle: model converting one back-edge to an async event (remove the edge) and show the cycle disappears — the structural case for pa-03.
Add topoOrder(): for an acyclic graph, return a deployment/build order (topological sort); error if a cycle exists. (This is also how pa-07's IaC engine orders apply.)
Coupling budget: add a fitness function "no service may have fan-in > N or be in a cycle"; wire it as a failing test (pa-10).
Weighted blast radius: weight edges by call volume and rank services by traffic-weighted blast radius to prioritize hardening.

pa-01 — References

Decomposition & boundaries

Sam Newman, Building Microservices (2nd ed.) — ch. on decomposition, the distributed-monolith trap, integration styles.
Eric Evans, Domain-Driven Design; Vaughn Vernon, Implementing DDD — bounded contexts and context maps (the basis for boundaries).
Martin Fowler — MicroservicePremium, IntegrationDatabase, StranglerFigApplication, PresentationDomainDataLayering. https://martinfowler.com/
Team Topologies (Skelton/Pais) — Conway's Law, team-sized services, the socio-technical view of boundaries.
Software Architecture: The Hard Parts (Ford/Richards) — component coupling, the "is it a service?" decision.

Algorithms used

Tarjan's strongly-connected-components algorithm (cycle detection).
Topological sort (deployment/build ordering on a DAG — exercise §5.3, reused in pa-07).

Tools that do this for real

Dependency-cycle linters (e.g. Go's import-cycle errors, go mod graph, ArchUnit for the JVM, import-linter for Python).
Service maps from distributed tracing (gw-11) and service meshes (gw-08).
Backstage / service catalogs for ownership + dependency metadata.

Cross-lab links

pa-02 (contracts), pa-03/04 (async to break coupling), pa-05 (outbox vs shared DB), pa-10 (cycle/layering checks as fitness functions), gw-12 (strangler-fig moves as migrations).

pa-01 — Analysis

A design-review treatment of the service-graph analyzer and the decomposition decisions it informs.

What the analyzer must get right

Sound cycle detection. Report exactly the strongly-connected components of size > 1 (plus self-loops); a clean DAG reports nothing.
Correct blast radius. Transitive dependents (reverse edges), not dependencies; exclude the service itself; deterministic (sorted).
Honest coupling metrics. Fan-in/out are direct-edge counts; blast-radius is the transitive version of fan-in.
Enforceable layering rules. A rule is a (from-layer, to-layer) prohibition checked over every edge.

Design decisions

Edge direction = "depends on." A→B means A needs B, so failure propagates against the edges (B's failure breaks A). Blast radius therefore walks edges in reverse. Getting this direction right is the whole correctness of the tool.
Tarjan over naive DFS-coloring. Tarjan returns the actual SCCs (the cyclic groups), which is what you report to engineers — not just a boolean "has cycle." Deterministic via sorted iteration.
Layers as free-form tags. Rather than hard-coding a layer model, a service carries a tag and rules are data, so the same engine enforces "domain↛infra," "no upward deps," or team-ownership rules. This is the hook for pa-10 fitness functions.

Tradeoffs worth flagging

The graph is only as true as its source. Edges from a wiki are fiction; derive them from import graphs, call traces, or a service registry. Garbage in, confident-but-wrong out.
Cycles aren't always fixable by splitting. Sometimes two services genuinely co-evolve and should be merged back into one. The tool flags the smell; the architect decides split vs merge vs async-break.
Low coupling vs too many services. Driving fan-out to zero by splitting more creates nano-services (more hops, more contracts). The metric guides; it doesn't dictate.
Static structure misses temporal coupling. Two services with no edge can still be coupled if they must change together for a feature. Pair the graph with change-coupling analysis (files/services that commit together).

What production adds beyond this lab

Auto-derived graphs from tracing/mesh + git history (change coupling).
Weighted edges (call volume, latency) for traffic-weighted blast radius.
Contract-compatibility gates (pa-02) wired into the same CI as the cycle/layering fitness functions (pa-10).
Ownership/team metadata (Conway's-Law analysis: do boundaries match teams?).

pa-01 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-01-microservices-decomposition && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-01-microservices-decomposition/src/go
go test -race -count=1 ./...      # cycle, blast-radius, fan-in/out, layering
go run ./cmd/sgsim                # the X-ray of a sample platform

Package map

File	What
`servicegraph/graph.go`	dependency graph; Tarjan cycle detection; blast radius; fan-in/out; layering rules
`cmd/sgsim`	builds a sample graph and prints cycles, blast radius, coupling, violations

Use it on your system

Emit your services' dependency edges (AddDependency(from, to)) from an import graph / trace service-map, SetLayer each service, then run the analyses. The cycles and chokepoints are your architecture backlog. See GUIDE.md §3 for what each finding implies.

pa-01 — Verification

One command

cd pa-01-microservices-decomposition && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestCycleDetection`	a 3-node dependency cycle is found (and reported as the SCC)
`TestSelfLoopIsCycle`	a self-dependency counts as a cycle
`TestAcyclicHasNoCycle`	a clean DAG reports no cycles
`TestBlastRadius`	transitive dependents are correct; a leaf has an empty radius
`TestFanInOut`	afferent/efferent coupling counts are correct
`TestLayeringViolations`	a domain→infra edge is flagged against the rule

All under -race.

What "green" does NOT guarantee

The graph's truth is your responsibility. The analyzer is sound; garbage edges give confident-but-wrong findings (derive edges from real imports/traces).
Structure, not temporal coupling. Services that must change together but share no edge aren't caught (add change-coupling analysis).
No contract checking here. API/event compatibility is pa-02; these checks become CI fitness functions in pa-10.

pa-02 — API Design Across REST, gRPC, and Events

Once you've drawn service boundaries (pa-01), the contracts across those boundaries are what keep services decoupled over time. The Apple JD asks for "strong API design abilities across REST, gRPC, and event-driven interfaces." The architect's job here is less "design a pretty endpoint" and more "design interfaces that evolve without breaking the dozens of consumers you'll never meet."

This lab builds the mechanics behind durable contracts: a compatibility checker (the rules a tool like buf breaking enforces), an idempotency store (safe retries of non-idempotent operations), and opaque pagination cursors (change the implementation without breaking the contract).

1. What is it?

An API contract is the explicit, versioned interface a service exposes plus its guarantees. Three styles, same goal (let producer and consumer evolve independently):

REST — resources + verbs over HTTP; contract = the resource shapes, status codes, idempotency semantics, pagination, and versioning.
gRPC — typed RPC over HTTP/2 (gw-02) with protobuf schemas; contract = the .proto and its backward/forward-compatibility rules.
Events — async messages (pa-03/04); contract = the event schema in a schema registry with compatibility rules, plus delivery guarantees (ordering, at-least-once).

Compatibility is the property that makes a contract a contract: backward (new consumers read old producers' data) and forward (old consumers read new producers' data). Protobuf's tag-numbered fields make this tractable; the checker here encodes the rules.

2. Why does it matter?

It's what makes independent deployability real. Boundaries (pa-01) only decouple if a producer can change without a lockstep consumer deploy. That requires compatible contract evolution — and automated enforcement, or it silently rots.
The dangerous changes are invisible without rules. Reusing a protobuf tag, changing a field's type, or tightening optional→required doesn't fail to compile — it corrupts data or breaks consumers at runtime, in production, far from the change. A compatibility gate in CI is the architect's seatbelt.
Idempotency is non-negotiable at scale. Networks retry; queues deliver at-least-once (pa-03/05); gateways retry (gw-06). Without idempotency keys, "create order" and "charge card" double-apply. Designing idempotency into the contract is core API design.
Opaque contracts preserve freedom. An opaque cursor lets you change pagination from offset to keyset without breaking clients; a leaked internal id or an offset-based contract freezes your implementation. Good API design hides what should be free to change.

3. How does it work?

Contract compatibility (the rules in code)

Protobuf identifies fields by a stable tag number, not position or name — that's what lets schemas evolve. The checker (Check(old, new)) applies the rules the ecosystem enforces:

Change	Verdict	Why
add an optional field	safe	old data lacks it → default; old readers ignore it
add a required field	breaking	old messages can't satisfy it
remove a field	warning	reserve the tag so it's never reused
change a field's type (same tag)	breaking	wire-incompatible; data misread
rename (same tag/type)	warning	wire-OK; source-level change
optional → required	breaking	old producers may omit it
required → optional	safe	loosening

HasBreaking is the CI gate: block a merge that breaks the wire.

Idempotency (safe retries)

An idempotency key (client-supplied, unique per logical operation) lets the server dedupe retries: first call executes and caches the result; retries return the cached result without re-executing (IdempotencyStore.Do). Failures aren't cached, so genuine retries of a failed op still run. This is how non-idempotent operations (POST /orders, charge) become safe under at-least-once delivery and client retries.

Pagination cursors (opaque contracts)

A cursor encodes the position opaquely (here: offset + a checksum, base64'd). The client treats it as a blob; the server can change the underlying scheme (offset → keyset) without a contract change, and a tampered cursor is rejected rather than silently returning wrong data. The lesson generalizes: expose intent, hide implementation.

Versioning strategies (when compatibility isn't enough)

in-place evolution:   add optional fields; never break — preferred.
URI/version header:   /v1, /v2 (REST) for breaking changes; run N versions.
new RPC/method:       gRPC — add GetUserV2 rather than break GetUser.
new event type/topic: events — emit OrderPlacedV2 alongside V1 during migration.

The architect picks the cheapest path that keeps consumers working, and sunsets old versions on a deprecation schedule (a migration, gw-12).

4. Core terminology

Term	Definition
API contract	The explicit, versioned interface + guarantees a service exposes.
Backward / forward compatibility	New code reads old data / old code reads new data.
Tag number	Protobuf's stable per-field id; the basis of safe evolution.
Schema registry	A store of event/message schemas + compatibility enforcement.
Idempotency key	Client-supplied key letting the server dedupe retries of a non-idempotent op.
Idempotent operation	One that can be applied multiple times with the same effect as once.
Cursor pagination	Opaque position token; decouples the API from the storage scheme.
Deprecation	The scheduled sunset of an old contract version.
Wire-breaking change	A change that corrupts serialization (tag reuse, type change).

5. Mental models

A contract is a promise you keep while changing your mind. It frees you to rewrite internals. The moment a consumer depends on something not in the contract (field order, an internal id, an offset), you've silently narrowed the promise and recoupled.
Tag numbers are seat assignments, not row order. Protobuf reads by seat number (tag), so you can reorder, rename, or add seats freely — but you must never give an old seat number to a different passenger (reuse a tag) or change who sits there (change a type). "Reserve the tag" is "keep the seat empty forever."
Idempotency keys are coat-check tickets. You hand in your coat (operation) once and get a ticket (key). Show the ticket again and you get the same coat back — not a second coat. At-least-once delivery hands you the ticket multiple times; the coat-check applies the operation once.
An opaque cursor is a valet ticket, not your car keys. You can't use it to drive (peek at the offset / forge a position); you hand it back and the valet fetches the next page. Hiding the mechanism lets the valet reorganize the lot (offset → keyset) without changing your ticket.

6. Common misconceptions

"Versioning = put /v2 in the URL." That's the last resort (a breaking change you must run in parallel and migrate off). Most evolution should be in-place via compatible changes; reserve new versions for genuinely breaking ones.
"Adding a field is always safe." Adding an optional field is; adding a required one breaks old producers, and reusing a retired tag corrupts the wire. Run a compatibility check; don't eyeball it.
"GET is idempotent so I'm fine." The problem is the writes (POST/ PATCH) under retries and at-least-once delivery. Idempotency keys are for the non-idempotent operations, which is most of the interesting ones.
"Offset pagination is fine." It leaks the storage model into the contract (you can't switch to keyset later without breaking clients) and is incorrect under concurrent inserts (items shift between pages). Opaque cursors fix both.
"REST vs gRPC vs events is a religious choice." It's per-edge: gRPC for typed internal RPC (gw-02), REST for broad external reach, events for decoupling/fan-out (pa-03). One platform uses all three; the architect chooses per interface based on coupling and reach.

7. Interview talking points

"How do you evolve an API without breaking consumers?" Compatible- change rules (add optional, never reuse tags, never change types), enforced by a compatibility check in CI (show buf breaking / schema-registry compat). Breaking changes get a new version run in parallel + a deprecation schedule. Name backward vs forward compatibility.
"How do you make a non-idempotent API safe under retries?" Client-supplied idempotency keys; the server dedupes (first executes + caches, retries return the cached result), doesn't cache failures, and expires keys. Essential under gw-06 retries and pa-03/05 at-least-once delivery.
"REST vs gRPC vs events — when each?" gRPC: typed, low-latency internal RPC with strong schemas (HTTP/2, gw-02). REST: broad/external reach, cacheability, simplicity. Events: decoupling, fan-out, load-leveling, audit (pa-03). Per-edge decision driven by coupling and consumer reach.
"Design pagination." Opaque cursors (encode position + integrity), not offsets — so you can change the storage scheme without breaking the contract and stay correct under concurrent writes. Bound page size; make the cursor tamper-evident.
"How do you govern contracts across 50 teams?" A schema registry with enforced compatibility, contract checks in CI (a fitness function, pa-10), consumer-driven contract tests, and a deprecation policy. The architect provides the paved road so teams evolve safely without a central bottleneck.

8. Connections to other labs

pa-01 — contracts are what keep the boundaries you drew decoupled over time.
pa-03 / pa-04 — event/message schemas are contracts too, governed by a registry with the same compatibility rules.
pa-05 — idempotency keys make at-least-once delivery and retried/replayed messages safe (effectively-once).
pa-10 — the compatibility check becomes a fitness function in CI; consumer-driven contract testing is a testing strategy.
gw-02 — gRPC/HTTP-2 wire format; gRPC status in trailers.
gw-06 — retries are why idempotency matters; gw-12 runs version deprecations as migrations.

pa-02 — The Hitchhiker's Guide to API & Contract Design

Companion to CONCEPTS.md, with the runnable contract toolkit in src/go/apicontract/. The architect's job: design interfaces that evolve without breaking consumers you'll never meet.

bash scripts/verify.sh runs the demo:

contract compatibility check (old -> new):
  [BREAKING] tag 2: field "email" became required (old producers may omit it)
  [SAFE    ] tag 3: optional field "nickname" added
  => CI gate: breaking=true
idempotent retries (same key runs once): side effect executed 1 time(s)
opaque pagination cursor: tampered cursor rejected

Three small pieces, each a contract-design lesson.

1. Compatibility as a CI gate (compat.go)

Protobuf identifies fields by a stable tag number, which is exactly what lets schemas evolve. Check(old, new) encodes the rules the ecosystem enforces (the same ones buf breaking checks), and HasBreaking is the gate you wire into CI. The tests pin each rule: adding an optional field is SAFE, adding a required one is BREAKING, changing a field's type on a tag is BREAKING (the wire misreads it), a rename is a WARNING (wire-OK, source change), and tightening optional→required is BREAKING.

The architect's insight: these dangerous changes don't fail to compile — they fail in production, far from the change, as corrupted data or broken consumers. So you make them fail in CI instead, as a fitness function (pa-10). "Reserve the tag" on removal is the one counter-intuitive rule: a removed tag must never be reused, or old data on that tag is silently misinterpreted.

Java/Python note: the rules are identical; the tooling differs (buf, Confluent Schema Registry with Avro/Protobuf compat modes). The policy — compatible-by-default, breaking-changes-get-a-new-version — is the architecture decision; the checker enforces it.

2. Idempotency = safe retries (idempotency.go)

Networks retry, queues deliver at-least-once (pa-03/05), gateways retry (gw-06). Without protection, "create order" and "charge card" double-apply. IdempotencyStore.Do(key, fn) executes fn once per key and returns the cached result on retries (TestIdempotencyReplay: three attempts, one side effect). Two design details that matter:

Failures aren't cached (TestIdempotencyDoesNotCacheFailures) — a genuinely failed operation must remain retryable.
Keys expire (TTL) — you can't remember every key forever; the window must exceed the client's retry horizon.

Designing idempotency into the contract (the client sends an Idempotency-Key header / field) is what makes at-least-once delivery (pa-05) safe — "effectively-once" = at-least-once + idempotent consumer.

3. Opaque cursors = freedom to change (cursor.go)

EncodeCursor/DecodeCursor produce a base64, checksum-protected position token. The client treats it as opaque; TestCursorTamperRejected shows a mangled cursor is rejected (not silently mis-served), and TestCursorRoundTrip confirms clean round-trips. Why this is API design, not a utility: an offset in the contract freezes your storage model — you can never switch to keyset pagination without breaking clients, and offsets are wrong under concurrent inserts (items shift between pages). An opaque cursor hides the mechanism, so you keep that freedom. The general principle — expose intent, hide implementation — is the heart of durable API design.

4. The architect's decision: which style per edge

There's no single right answer; it's per interface:

gRPC   — typed internal RPC, low latency, strong schemas (gw-02).
REST   — broad/external reach, cacheable, simple.
events — decouple in time, fan-out, load-level, audit (pa-03).

A real platform uses all three. The architect's contribution is choosing per edge (by coupling + reach), and governing evolution centrally (schema registry + compat checks + deprecation policy) so 50 teams ship independently without a review bottleneck.

5. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/apidemo

6. Exercises

Wire the gate into CI: make HasBreaking fail a test when a committed schema breaks the previous one (a fitness function, pa-10).
Reserved tags: extend Check to track reserved (removed) tags and flag a new field that reuses one — the wire-corruption case a two-schema diff can't see without history.
Consumer-driven contracts: model a consumer's expected fields and verify a producer change doesn't break that specific consumer.
Keyset pagination: change the cursor to encode a (last_id, last_sort_key) keyset; show the contract (opaque cursor) is unchanged — the payoff of opacity.
gRPC trailers: model gRPC status-in-trailers (gw-02) and show why an HTTP-200-but-grpc-error must be counted as a failure (gw-11).

pa-02 — References

Contracts & compatibility

Protocol Buffers docs — field numbers, reserved fields, proto3 semantics, the rules behind safe evolution. https://protobuf.dev/programming-guides/proto3/
buf — buf breaking (the compatibility-rule engine modeled here). https://buf.build/docs/breaking/overview
Confluent Schema Registry — Avro/Protobuf/JSON compatibility modes (BACKWARD / FORWARD / FULL) for event schemas (pa-03/04). https://docs.confluent.io/platform/current/schema-registry/
Google API Improvement Proposals (AIP) — resource-oriented REST, pagination, idempotency, versioning. https://google.aip.dev/
Designing Data-Intensive Applications (Kleppmann) — ch. 4, encoding and schema evolution.

Idempotency & pagination

Stripe — Idempotent requests (the canonical idempotency-key design). https://docs.stripe.com/api/idempotent_requests
"Pagination: offset vs cursor/keyset" writeups (Slack, Shopify engineering blogs) — why opaque cursors beat offsets.

API styles

gRPC docs (pairs with gw-02 for the HTTP/2 wire). REST: Richardson maturity model, RFC 9110 semantics. Event APIs: AsyncAPI spec.
Consumer-Driven Contracts (Pact) — testing contracts from the consumer's side (a testing strategy, pa-10).

Cross-lab links

pa-01 (boundaries), pa-03/04 (event schemas as contracts), pa-05 (idempotent consumers), pa-10 (compat check as a fitness function), gw-02 (gRPC wire), gw-06/gw-12 (retries / version deprecation as migration).

pa-02 — Analysis

What the toolkit must get right

Sound compatibility rules. Type change / add-required / tighten-to- required are breaking; add-optional / loosen are safe; remove / rename are warnings. HasBreaking is the CI gate.
Idempotency that's actually safe. Run once per key; never cache failures; expire keys past the client's retry horizon.
Tamper-evident opaque cursors. A mangled cursor errors; the client cannot depend on the internal position scheme.

Design decisions

Tag-keyed diff. Fields are matched by tag number (protobuf semantics), so rename/reorder are distinguishable from add/remove/type-change. This is what makes the verdicts correct.
Severity, not boolean. Breaking / Warning / Safe lets CI block only the truly dangerous changes while still surfacing "reserve the tag" and rename hygiene.
Checksum, not crypto, on cursors. CRC catches accidental/garbled cursors cheaply; for adversarial tampering you'd use an HMAC with a server secret (noted as the production upgrade).
Don't cache idempotency failures. A cached failure would make a transient error permanent for that key — the opposite of the goal.

Tradeoffs worth flagging

Two-schema diff can't see history. Reusing a previously-removed tag is the most dangerous change and needs a reserved-tag registry to detect (exercise §6.2) — a schema registry does this.
Compat rules are policy. "Backward only" vs "full" compatibility is an architecture decision with real cost (full compat constrains both producers and consumers). The checker enforces whatever policy you set.
Idempotency storage is a real system. A production key store is a TTL'd, replicated KV on the write path; its availability bounds the API's. (db-20 / a Redis-class store.)
Opaque cursors trade debuggability for freedom. You can't eyeball a cursor in a URL; that's the point, but add server-side decoding tools.

What production adds beyond this lab

A schema registry with enforced compat modes + a deprecation workflow.
HMAC-signed cursors; keyset (not offset) pagination underneath.
A replicated idempotency-key store with exactly-once-ish semantics.
Consumer-driven contract tests + the compat gate wired into CI (pa-10).

pa-02 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-02-api-design && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-02-api-design/src/go
go test -race -count=1 ./...      # compat rules, idempotency, cursors
go run ./cmd/apidemo

Package map

File	What
`apicontract/compat.go`	protobuf-style schema compatibility check + `HasBreaking` CI gate
`apicontract/idempotency.go`	idempotency-key store (run-once, no-cache-on-failure, TTL)
`apicontract/cursor.go`	opaque, tamper-evident pagination cursors
`cmd/apidemo`	a compat check, idempotent retries, and a cursor round-trip + tamper

See GUIDE.md for the deep dive.

pa-02 — Verification

One command

cd pa-02-api-design && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestCompatAddOptionalIsSafe`	adding an optional field is compatible
`TestCompatAddRequiredIsBreaking`	adding a required field breaks old producers
`TestCompatRemoveFieldIsWarning`	removal warns (reserve the tag)
`TestCompatTypeChangeIsBreaking`	changing a tag's type is wire-breaking
`TestCompatOptionalToRequiredIsBreaking`	tightening to required is breaking
`TestCompatRenameIsWarningNotBreaking`	a rename (same tag/type) is wire-OK
`TestIdempotencyReplay`	same key runs the side effect once; retries return cached
`TestIdempotencyDoesNotCacheFailures`	failures stay retryable
`TestCursorRoundTrip`	cursors encode/decode losslessly; empty = offset 0
`TestCursorTamperRejected`	tampered/garbage cursors are rejected

All under -race.

What "green" does NOT guarantee

No reserved-tag history. Tag reuse (the worst change) needs a registry to detect (exercise) — a two-schema diff can't.
CRC, not HMAC. Cursors resist accidents, not a determined attacker (use HMAC in production).
In-memory idempotency. A real key store is a replicated TTL'd KV on the write path.

pa-03 — Event-Driven Architecture & Async Patterns

This is the headline of the Apple JD: "experience designing event-driven architectures and asynchronous communication patterns." Synchronous request/reply (pa-02) couples services in time — the caller is only as available as the callee, and latency compounds down the chain. Events break that coupling: a producer emits a fact and moves on; consumers react when they're ready. The price is eventual consistency — ordering, duplicate delivery, and "where did my event go" become your problems.

This lab builds an event bus that makes those problems concrete: fan-out pub/sub, at-least-once delivery with retries, idempotent consumers (the dedup that makes at-least-once safe), and a dead-letter queue for poison messages.

1. What is it?

Event-driven architecture is a style where services communicate by producing and consuming events (immutable facts: "OrderPlaced") over a broker, rather than calling each other directly. Three sub-styles (increasing power and complexity):

Event notification — "something happened, go look": a thin event; the consumer calls back for details. Low coupling, extra round-trips.
Event-carried state transfer — the event carries the data the consumer needs; no callback. Decoupled, but data is duplicated and can be stale.
Event sourcing — the event log is the source of truth; state is a fold over events (CQRS often pairs with it). Powerful audit/replay, big complexity jump.

And two coordination styles for multi-step workflows:

Choreography — services react to each other's events; no central brain. Loose coupling, but the workflow is emergent and hard to see.
Orchestration — a coordinator drives the steps (a saga, pa-05). Visible and controllable, at the cost of a central component.

2. Why does it matter?

It's the antidote to the distributed monolith (pa-01). Synchronous call chains make services co-available and co-failing; converting an edge to an event breaks the temporal coupling. An architect's most common "fix this design" move is "make this async."
It's how platforms absorb load and spikes. A queue/log levels load (producers and consumers run at their own pace) and provides a buffer during downstream outages — the consumer catches up later instead of the producer failing now.
The hard parts are where architects earn their keep. Delivery semantics (at-least-once vs exactly-once), ordering, idempotency, duplicates, poison messages, and the dual-write problem (pa-05) are subtle and dangerous. Knowing the patterns cold — and that exactly-once is a myth you approximate with at-least-once + idempotency — is the differentiator.
It enables independent evolution and fan-out. New consumers subscribe to an existing event stream without the producer knowing or changing — the open/closed principle at the platform level.

3. How does it work?

Pub/sub and fan-out

A producer publishes to a topic; the broker fans out to every subscriber. The producer doesn't know its consumers — that's the decoupling. Bus.Publish delivers to all subscriptions on a topic; adding a consumer is a Subscribe, no producer change.

Delivery semantics (pick your poison)

at-most-once:    deliver, don't retry. Simple; loses messages on failure. Rarely OK.
at-least-once:   deliver, retry until acked. No loss; DUPLICATES possible. The default.
exactly-once:    a MYTH end-to-end. Approximated by at-least-once + IDEMPOTENT consumers.

The bus delivers at-least-once with bounded retries. A handler that errors is retried up to maxRetries; if it keeps failing, the event is dead-lettered rather than blocking the stream forever.

Idempotent consumers (what makes at-least-once safe)

Because retries and broker redelivery cause duplicates, consumers must be idempotent: processing the same event id twice has the same effect as once. The bus dedupes by Event.ID (the consumer's seen set). Publish the same id three times → the handler runs once. This is the practical "effectively-once" the industry actually ships (pa-02 idempotency keys, gw-05 push dedup).

Dead-letter queue (don't let one poison message stop the line)

A message that can never succeed (malformed, a permanent bug) would, without a DLQ, retry forever and block its partition. The DLQ moves it aside after N attempts so the stream keeps flowing; operators inspect and replay DLQ'd events after fixing the consumer.

Backpressure

When consumers fall behind, something must give: a bounded queue either blocks the producer (backpressure — the gw-01/gw-02 idea) or drops (load-shedding, gw-06). An unbounded queue is a latent OOM. The architect chooses the bound and the overflow policy per stream.

4. Core terminology

Term	Definition
Event	An immutable fact that something happened ("OrderPlaced").
Topic / subject	A named stream consumers subscribe to.
Pub/sub / fan-out	One publish delivered to all subscribers.
At-least-once	Delivery that never loses but may duplicate; the practical default.
Idempotent consumer	Processing a duplicate has the same effect as processing once.
Dead-letter queue (DLQ)	Where un-processable messages go after exhausting retries.
Choreography / orchestration	Emergent (event-reaction) vs coordinated (a saga) workflows.
Event notification / -carried state / sourcing	Increasing amounts of data/authority in the event.
Backpressure	Slowing the producer when consumers can't keep up.
Replay	Re-processing events (from a DLQ or the log, pa-04).

5. Mental models

Sync is a phone call; async is a mailbox. A call needs both parties present now (temporal coupling); a letter is dropped and read later (decoupled). You can't lose a call you didn't make, but a mailbox keeps working when the recipient is out. Choose per edge; you move the complexity, you don't delete it.
At-least-once + idempotency = effectively-once. Stop chasing exactly-once delivery (a distributed-systems unicorn). Deliver redundantly and make the effect idempotent. The dedup set is the whole trick.
The DLQ is the ER triage room. You don't let one critically-broken patient (poison message) block the whole queue; you move them aside, keep the line moving, and treat them separately. A stream with no DLQ is a stream one bad message away from a stall.
A queue is a shock absorber, not a warehouse. It smooths spikes and buffers brief outages. If it's growing unboundedly, your consumers are permanently too slow — add capacity or shed; don't add disk.

6. Common misconceptions

"Exactly-once delivery exists." Not end-to-end. Brokers offer at-least-once (or "exactly-once" within their own boundary via transactions, but the moment your consumer has side effects, you need idempotency). Design for at-least-once + idempotent consumers.
"Async is always better / decoupling is free." Async adds eventual consistency, ordering concerns, and operational opacity (where's my event?). For an immediate answer the caller can't proceed without, sync is correct. The architect puts each edge on the right side.
"Events guarantee order." Only within a partition (pa-04). Across partitions/topics there's no global order. Design consumers to tolerate reordering, or partition by the key whose order matters.
"Just retry forever." Forever-retrying a poison message blocks the stream and can become a retry storm (gw-06). Bound retries + DLQ + backoff/jitter.
"Choreography is simpler, always use it." Pure choreography makes a multi-step workflow invisible and hard to change/debug — coupling hidden in event chains. For complex workflows with compensation, an orchestrated saga (pa-05) is clearer. Trade-off, not dogma.

7. Interview talking points

"When do you use events vs synchronous calls?" Sync when the caller needs an immediate answer to proceed (own the latency/availability cost: timeouts, breakers, bulkheads — gw-06/pa-09). Async to decouple in time, fan-out to many consumers, level load, or build an audit trail — accepting eventual consistency. Decide per edge; name the cost you're taking on.
"How do you handle duplicate deliveries?" Idempotent consumers: dedup by event id (or idempotency key, pa-02). State plainly that exactly-once is at-least-once + idempotency; don't claim the broker gives you exactly-once for free.
"How do you handle a message that can't be processed?" Bounded retries with backoff+jitter, then a DLQ so it doesn't block the stream; alert on DLQ depth; replay after fixing the consumer. Distinguish transient (retry) from permanent (DLQ immediately) failures.
"Choreography or orchestration?" Choreography for simple, loosely coupled reactions; orchestration (a saga, pa-05) when a multi-step workflow needs visibility, ordering, and compensation. The trade-off is decoupling vs observability/control.
"How do you prevent a slow consumer from taking everything down?" Bounded queues (backpressure or shed), consumer-group scaling (pa-04), per-consumer isolation (bulkheads, pa-09), and DLQ for poison. An unbounded buffer is a deferred OOM.
"Notification vs event-carried state vs event sourcing?" Increasing data/authority in the event: notification (thin, callback for details), carried-state (self-contained, data duplicated), sourcing (the log is truth, replayable, big complexity). Pick the least powerful that meets the need.

8. Connections to other labs

pa-01 — events break the temporal coupling that creates distributed monoliths.
pa-02 — event schemas are contracts (schema registry + compat); idempotency keys make at-least-once safe.
pa-04 — a partitioned log is the durable, replayable, ordered backbone this bus abstracts; ordering is per-partition.
pa-05 — the outbox solves the dual-write problem (DB + publish atomically); sagas are orchestrated event workflows with compensation.
gw-05 — push delivery is at-least-once + idempotent dedup at the edge; gw-06 — retries/backoff/jitter and load shedding.

pa-03 — The Hitchhiker's Guide to Event-Driven Architecture

Companion to CONCEPTS.md, with the runnable event bus in src/go/eventbus/. This is the JD headline: designing async, event-driven systems — and owning their failure modes.

bash scripts/verify.sh runs the demo: an OrderPlaced event fans out to audit/fulfillment/fraud consumers; a flaky consumer retries then succeeds; a re-published event is deduped; a poison message lands in the DLQ:

stats: delivered=8 retried=3 deduped=3 deadLettered=1
  DLQ: order-poison on "fraud" after 2 attempts: cannot score: malformed

That single line is the whole discipline: deliver redundantly, dedup idempotently, and quarantine what can't be processed.

1. Fan-out pub/sub (bus.go)

Publish(event) delivers to every subscriber on the topic; the producer doesn't know its consumers (TestFanOut, TestTopicIsolation). That's the decoupling that makes events powerful: a new consumer Subscribes to an existing stream with zero producer change — the open/closed principle at the platform level. It's also the structural fix for the distributed monolith (pa-01): converting a synchronous A→B call into "A emits, B consumes" removes the temporal coupling that made A only as available as B.

2. At-least-once + retries (bus.go)

Real brokers deliver at-least-once (never lose, may duplicate). The bus models the retry side: a handler that returns an error is retried up to maxRetries. TestRetryThenSucceed shows a consumer failing twice then succeeding — Retried=2, Delivered=1, nothing dead-lettered. The architect's framing: distinguish transient failures (retry with backoff+jitter, gw-06) from permanent ones (don't waste retries — DLQ immediately).

3. Idempotent consumers — the heart of it

Because at-least-once means duplicates, consumers must be idempotent. The bus dedups by Event.ID (each subscription's seen set): TestIdempotentDedup publishes the same id three times and the handler runs once (Deduped=2). This is the single most important event-driven lesson:

Exactly-once delivery is a myth. Exactly-once effect is at-least-once delivery + an idempotent consumer. Stop chasing the former; build the latter.

It connects straight to pa-02 (idempotency keys) and gw-05 (push dedup) — the same trick at three layers.

4. The dead-letter queue (bus.go)

A poison message (malformed, a permanent bug) would retry forever and block the stream. After maxRetries, the bus moves it to the DLQ (TestPoisonGoesToDLQ: order-poison dead-lettered after 3 attempts, never delivered, the stream keeps flowing). In production you alert on DLQ depth and replay after fixing the consumer. A stream without a DLQ is one bad message away from a stall — and naive "retry forever" is how a blip becomes a retry storm (gw-06).

5. The trade-offs an architect owns

The code is small; the decisions are the job:

Sync vs async per edge — immediate-answer vs decoupling. You move complexity (eventual consistency, ordering, dedup), you don't delete it.
Delivery semantics — at-least-once + idempotency is the default; at-most-once only where loss is acceptable.
Ordering — only within a partition (pa-04). Partition by the key whose order matters; otherwise design consumers to tolerate reordering.
Backpressure vs shed — bounded queues block the producer or drop; unbounded queues are a deferred OOM. Choose per stream.
Choreography vs orchestration — emergent vs coordinated (a saga, pa-05). Decoupling vs visibility.

This synchronous bus is a teaching model; a production system puts the events on a durable, partitioned, replayable log — which is exactly pa-04.

6. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/ebsim

7. Exercises

Async + backpressure: give each subscription a bounded channel and a worker; make Publish block (backpressure) or drop (shed) when full, and measure the difference under a slow consumer.
Backoff + jitter: add exponential backoff with jitter between retries (reuse gw-06's idea) and show it avoids synchronized retry storms.
Replay from DLQ: add Replay() that re-publishes DLQ'd events after you "fix" the handler; confirm idempotency prevents double-processing of anything already delivered.
Consumer groups: extend so multiple instances of one logical consumer share the load (each event to one instance) — the bridge to pa-04.
Choreographed saga: wire three consumers so OrderPlaced → reserve-inventory → charge-payment by reacting to each other's events; then feel the pain of no central view, and compare to pa-05's orchestrated saga.

pa-03 — References

Event-driven patterns

Martin Fowler — What do you mean by "Event-Driven"? (notification vs event-carried state vs event sourcing), EventCollaboration. https://martinfowler.com/articles/201701-event-driven.html
Chris Richardson, microservices.io — Saga, Event Sourcing, CQRS, the patterns catalog. https://microservices.io/patterns/
Designing Data-Intensive Applications (Kleppmann) — ch. 11, stream processing; delivery semantics; the dual-write problem.
Enterprise Integration Patterns (Hohpe/Woolf) — pub/sub, DLQ, message routing (the vocabulary).

Delivery semantics & idempotency

Kafka docs — delivery guarantees, idempotent producer, transactions.
"You Cannot Have Exactly-Once Delivery" (Tyler Treat) — why at-least-once + idempotency is the real answer.
AsyncAPI — describing event-driven APIs (the contract side, pa-02).

Brokers to contrast (pa-04 builds the log model)

Kafka / Pulsar (partitioned log), NATS/JetStream (subjects + streams), RabbitMQ (queues/exchanges). Know when each fits.

Cross-lab links

pa-02 (event schemas + idempotency keys), pa-04 (the durable partitioned log under this bus), pa-05 (outbox + saga), gw-05 (push at-least-once + dedup), gw-06 (retries/backoff/shed).

pa-03 — Analysis

What the bus must get right

Fan-out to all subscribers; topic isolation.
At-least-once with bounded retries, then DLQ (never block the stream on a poison message).
Idempotent consumers: a duplicate id is processed once.
Honest counters (delivered/retried/deduped/dead-lettered) — the signals you'd alert on (gw-11).

Design decisions

Synchronous delivery for determinism. Tests are reproducible and the patterns are legible. Production is async over a durable log (pa-04); the GUIDE's exercises add bounded async + backpressure.
Dedup per subscription, keyed by event id. Each consumer owns its idempotency; the bus doesn't assume a global dedup. This matches reality (consumers dedup independently).
Don't cache failures into seen. Only a successful handle marks the id seen, so a transient failure is retried, not silently skipped.
DLQ after maxRetries, with attempt count + reason. Operators need why and how-many to triage and replay.

Tradeoffs worth flagging

At-least-once vs exactly-once. We choose at-least-once + idempotent consumers — the only honest option end-to-end. Brokers' "exactly-once" stops at the broker boundary; your side effects still need idempotency.
Ordering. This bus preserves per-publish order to each subscriber but offers no cross-event ordering guarantee; real ordering is per-partition (pa-04). Don't design consumers that assume global order.
Backpressure vs shed. Synchronous delivery = implicit backpressure (the publisher waits). An async bus must choose bound + overflow policy; unbounded = deferred OOM.
DLQ is not a graveyard. Un-replayed DLQs hide real failures; alert on depth and have a replay path.

What production adds beyond this lab

A durable, partitioned, replayable log (pa-04) with consumer groups.
Async delivery with bounded queues, backoff+jitter, and consumer scaling.
A schema registry for event contracts (pa-02) and DLQ replay tooling.
End-to-end tracing across the async hop (gw-11) — the hardest observability problem in event systems.

pa-03 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-03-event-driven-architecture && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-03-event-driven-architecture/src/go
go test -race -count=1 ./...      # fan-out, retry, dedup, DLQ
go run ./cmd/ebsim

Package map

File	What
`eventbus/bus.go`	topic pub/sub, at-least-once retries, idempotent dedup, DLQ, counters
`cmd/ebsim`	fan-out + retry-then-succeed + dedup + poison→DLQ demo

See GUIDE.md for the deep dive and the sync-vs-async trade-offs.

pa-03 — Verification

One command

cd pa-03-event-driven-architecture && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestFanOut`	one publish reaches all subscribers
`TestTopicIsolation`	a subscriber never sees other topics' events
`TestRetryThenSucceed`	a transient failure recovers via retries; not dead-lettered
`TestPoisonGoesToDLQ`	a permanent failure is dead-lettered after maxRetries+1 attempts; never delivered
`TestIdempotentDedup`	duplicate ids are processed once (effectively-once)
`TestDistinctEventsBothRun`	distinct ids each run

All under -race.

What "green" does NOT guarantee

Synchronous, in-memory bus. Production is a durable partitioned log (pa-04) with async consumer groups; backpressure/ordering are exercises.
No cross-event ordering. Ordering is per-partition (pa-04), not global.
No tracing across the async hop — the hard observability problem (gw-11), out of scope here.

pa-04 — A Partitioned Log (Kafka in Miniature)

The Apple JD lists "message queues and streaming platforms, such as Kafka, RabbitMQ, NATS, or Pulsar." Under almost all modern streaming sits one data structure: the partitioned, append-only commit log. Master it and Kafka, Pulsar, and NATS JetStream stop being magic — they become "a log, split into partitions, with consumer groups tracking offsets." pa-03 gave you the patterns (pub/sub, at-least-once, DLQ); this lab gives you the substrate they run on, built and tested.

You build the log: partitions (the unit of ordering and parallelism), monotonic offsets, key-based partitioning, consumer groups with committed offsets, partition assignment (range / round-robin), and retention.

1. What is it?

A commit log is an append-only, ordered sequence of records, each at a monotonically increasing offset. You only append (never update in place) and read sequentially from an offset. That's it — and it's enough to build messaging, event streaming, replication, and event sourcing.

To scale beyond one machine and one consumer, the log is split into partitions:

topic "orders", 3 partitions:
  P0:  [0:acct-1][1:acct-1][2:acct-1]            (append-only, ordered)
  P1:  [0:acct-2][1:acct-2]
  P2:  [0:acct-3]
        ▲ offset (per-partition, monotonic, stable)

A producer appends a record; its key is hashed to choose a partition, so all records for a key land in one partition and are therefore ordered relative to each other.
A consumer group reads partitions and commits offsets (where it is); on restart it resumes from the commit. Different groups read the same log independently at their own pace.
Partition assignment spreads a topic's partitions across the consumers in a group — the unit of parallelism (≤ one consumer per partition per group).
Retention drops old records (by time or size); offsets stay absolute and stable.

This is db-03's write-ahead log idea (you built one earlier) promoted to a distributed messaging primitive.

2. Why does it matter?

It's the backbone of event-driven platforms. pa-03's bus is an abstraction; a real system needs durability, ordering, replay, and parallel consumption — which is exactly what the partitioned log provides. An architect designing an eventing platform is choosing and configuring this.
Partitions are the whole scaling and ordering story. Throughput scales with partitions; ordering is guaranteed only within a partition. Choosing the partition key is therefore one of the most consequential design decisions: it decides what's ordered, what's parallelizable, and whether you get hot partitions (skew).
The log unifies messaging and state. Because it's durable and replayable, the same log powers queues, pub/sub, stream processing, CDC, and event sourcing (the log is the source of truth). "Turn the database inside out" (Kleppmann) is this insight.
Offsets + consumer groups give you decoupling in time and re-readability. A new consumer can replay history; a slow consumer catches up later; a bug fix can reprocess from an earlier offset. That flexibility is impossible with a fire-and-forget queue.

3. How does it work?

Producing and partitioning

Produce(key, value) hashes the key to a partition (PartitionFor), appends, and returns (partition, offset). Same key → same partition → ordered. Keyless records round-robin (max parallelism, no ordering). This choice — what to use as the key — is the architecture decision: order per accountId? then key by account, and accept that one hot account is one hot partition.

Offsets and consuming

Offsets are per-partition, monotonic, and absolute (stable forever). Consume(partition, fromOffset, max) returns records at or after an offset — a consumer polling from where it left off. HighWatermark is the next offset; a consumer is caught up when its commit equals it.

Consumer groups

A consumer group is a set of cooperating consumers that share a topic's partitions: each partition is read by at most one consumer in the group (so within a group, work is divided; across groups, the log is re-read independently). GroupOffsets.Commit/Committed tracks per-(group, partition) progress so a restart resumes correctly. Commit after processing for at-least-once; before for at-most-once.

Partition assignment & rebalancing

When consumers join/leave, partitions are reassigned. AssignRange (contiguous chunks; earlier consumers get extras) and AssignRoundRobin (one at a time) are Kafka's two classic strategies. Rebalancing is disruptive (consumers stop, reassign, resume) — modern Kafka adds cooperative/sticky rebalancing to minimize movement, the same "minimize reassignment under membership change" goal as gw-04's subsetting and pa-06's consistent hashing.

Retention

Truncate(partition, beforeOffset) drops old records (by offset here; by time/size in production). Offsets of survivors are unchanged, so committed consumer offsets remain valid — a consumer that fell behind the retention window simply resumes at the earliest retained record (and may have lost data, which is the operational risk of too-short retention).

4. Core terminology

Term	Definition
Commit log	Append-only, ordered sequence of records addressed by offset.
Partition	One shard of a topic; the unit of ordering and parallelism.
Offset	A record's monotonic, stable position within its partition.
High watermark	The next offset to be written; "caught up" = committed == HW.
Partition key	The field hashed to choose a partition; decides ordering + skew.
Consumer group	Consumers sharing a topic's partitions (≤1 consumer per partition).
Committed offset	Where a group will resume on a partition.
Rebalancing	Reassigning partitions to consumers when membership changes.
Retention	Dropping old records by time/size; offsets stay absolute.
Replay	Re-reading from an earlier offset (reprocessing).
Hot partition	A skewed key sending disproportionate traffic to one partition.

5. Mental models

A partition is a single-file ledger; the topic is a filing cabinet of them. Each ledger is strictly ordered (append-only); the cabinet parallelizes across ledgers. You only get order within a ledger, so you file related entries (same key) in the same one.
Offsets are bookmarks, not deletions. Reading doesn't consume; many readers keep their own bookmark in the same book. That's why the log supports replay, multiple independent consumers, and reprocessing — unlike a queue where a read pops the message.
The partition key is a routing decision you can't easily undo. It fixes ordering and load distribution. Pick a key that's high-cardinality (avoid hot partitions) and aligned with your ordering needs. Changing partition count later reshuffles the mapping — a migration (gw-12).
Retention is a moving cliff behind your slowest consumer. As long as consumers stay within the window, all is well; fall behind the cliff and you lose data silently. Monitor consumer lag (HW − committed) like an SLI (pa-09).

6. Common misconceptions

"Kafka guarantees global ordering." Only per partition. Across partitions there's no order. If you need order for a key, partition by that key; if you need global order, you need one partition (and you've given up parallelism).
"More partitions is always better." Partitions cost (open files, memory, rebalance time, end-to-end latency, and metadata). Size for target throughput and consumer parallelism, not "as many as possible."
"A consumer group with N consumers can use any number of partitions." Parallelism is capped at the partition count: with 4 partitions, the 5th consumer in a group sits idle. Partitions, not consumers, set the ceiling.
"Reading consumes the message" (queue thinking). In a log, reading advances your offset only; the data stays for other consumers and replay until retention removes it.
"Exactly-once because Kafka transactions." Within Kafka's boundary, yes; but your consumer's external side effects still need idempotency (pa-02/03). Don't conflate broker exactly-once with end-to-end.

7. Interview talking points

"Design a streaming/eventing platform." Partitioned log → key-based partitioning (justify the key: ordering + skew) → consumer groups + offset commit → at-least-once + idempotent consumers (pa-03) → DLQ → retention sized to consumer lag → schema registry for the event contract (pa-02). Name the per-partition ordering guarantee explicitly.
"How do you guarantee ordering?" Only within a partition. Partition by the entity whose order matters (e.g. accountId), accepting that a hot key is a hot partition. Global ordering = one partition = no parallelism; usually you don't actually need it.
"Kafka vs RabbitMQ vs NATS vs Pulsar?" Log (Kafka/Pulsar): durable, replayable, ordered-per-partition, high throughput, consumer groups — for event streaming/sourcing. Queue (RabbitMQ): rich routing, per-message ack, competing consumers — for task distribution. NATS: lightweight pub/sub + JetStream for streams. Pick by replay/ordering/ routing needs.
"How do consumer groups scale and rebalance?" Each partition →ne Each partition → one consumer in a group; parallelism caps at partition count. Rebalancing reassigns on membership change (range/round-robin/sticky); sticky/cooperative minimizes movement — the same stability concern as gw-04/pa-06.
"How do you choose a partition count and retention?" Partitions for target throughput + max consumer parallelism (with headroom — changing it is a migration). Retention long enough to cover consumer downtime + replay needs; monitor lag (HW − committed) as an SLI, alert before the retention cliff.

8. Connections to other labs

db-03 (write-ahead log) — the same append-only-log idea you built for crash recovery, here as a distributed messaging primitive.
pa-03 (event-driven) — this log is the durable, ordered, replayable backbone under that bus; ordering is per-partition.
pa-05 (outbox/saga) — the outbox publishes to a log like this; consumers are idempotent because delivery is at-least-once.
pa-06 (partitioning) — key→partition hashing and rebalancing are the same partitioning/stability problems, for data instead of streams.
gw-04 (subsetting) — minimizing reassignment under membership change is the shared theme with consumer-group rebalancing.

pa-04 — The Hitchhiker's Guide to the Partitioned Log

Companion to CONCEPTS.md, with the runnable log in src/go/plog/. pa-03 gave you the event patterns; this is the substrate — the one data structure under Kafka, Pulsar, and NATS JetStream.

bash scripts/verify.sh runs the demo: all acct-1 records share a partition (ordered), a consumer group commits and resumes, partitions get assigned across consumers two ways, and retention truncates while offsets stay absolute.

1. The log + partitioning (log.go)

A PartitionedLog is n append-only slices. Produce(key, value) hashes the key (PartitionFor) to pick a partition, appends, and returns a monotonic offset. The two facts that flow from this:

Same key → same partition → ordered. TestKeyMapsToSamePartition and TestPerPartitionOrdering prove all acct-1 records land together with offsets 0,1,2,… in order. Ordering is only within a partition — the single most important property to internalize.
Offsets are absolute and stable. Consume(partition, fromOffset, max) returns records at/after an offset; HighWatermark is the next offset. A consumer polls from where it left off — reading doesn't consume (unlike a queue), so many consumers and replay coexist.

The architecture decision hiding here is the partition key. Key by accountId and you get per-account ordering but a hot account is a hot partition (skew); key randomly and you get even load but no ordering. That trade-off is yours to own.

2. Consumer groups + offset resume (groups.go)

GroupOffsets tracks per-(group, partition) progress. TestConsumerGroupOffsets walks the lifecycle: a fresh group starts at 0, reads 2, commits next=2, "restarts," and resumes at offset 2 — and a different group reads the same log independently from 0. That independence is the log's superpower: billing and analytics consume the same orders stream at their own pace, and a new consumer can replay history.

Commit timing is a delivery-semantics choice: commit after processing for at-least-once (a crash re-delivers — needs idempotency, pa-03); commit before for at-most-once (a crash loses it). The architect picks per consumer.

3. Assignment & rebalancing (groups.go)

A consumer group splits a topic's partitions across its members. AssignRange (contiguous chunks; earlier consumers get the extra) and AssignRoundRobin (one at a time) are Kafka's classic strategies — TestRangeAssignment/TestRoundRobinAssignment pin both for 5 partitions over 2 consumers. The ceiling: parallelism caps at the partition count — a 5th consumer on a 4-partition topic sits idle. Rebalancing on membership change is disruptive (stop-the-world reassign); production adds sticky/cooperative rebalancing to minimize movement — the exact same stability goal as gw-04's subsetting ring and pa-06's consistent hashing.

4. Retention (log.go)

Truncate(partition, beforeOffset) drops old records, but survivors keep their original absolute offsets (TestRetentionKeepsOffsetsStable: after truncating below 3, the first record is still offset 3, and the high watermark is unchanged). So committed offsets stay valid; a consumer that fell behind the retention window simply resumes at the earliest retained record — having silently lost the truncated data. That's the operational risk: retention is a moving cliff behind your slowest consumer, so you monitor consumer lag (HighWatermark − committed) as an SLI (pa-09) and alert before the cliff.

5. Why this is the architect's keystone for eventing

Once the log is concrete, the whole eventing stack composes:

pa-03's pub/sub, at-least-once, and DLQ run on this log.
pa-05's outbox publishes to it; idempotent consumers handle its at-least-once redelivery.
Event sourcing is "the log is the source of truth; state is a fold over it" — replay from offset 0.
Choosing Kafka vs Pulsar vs NATS vs RabbitMQ becomes a concrete comparison of this model (log + groups + retention) vs a queue model.

6. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/plogsim

7. Exercises

Consumer lag SLI: compute HighWatermark − committed per partition and alert when it exceeds a threshold (wire to pa-09).
Hot-partition detection: produce skewed keys and measure per-partition record counts; show how a bad key creates skew.
Sticky rebalancing: implement an assignor that, when a consumer leaves, reassigns only its partitions and keeps others put; compare movement to range/round-robin (the gw-04/pa-06 stability theme).
Replay: reprocess a partition from offset 0 with a new consumer group; show idempotent consumers (pa-03) make this safe.
Time-based retention: add timestamps and truncate by age; reason about the lag/retention SLO relationship.

pa-04 — References

The log model

Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction — the foundational essay. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Apache Kafka docs — topics, partitions, offsets, consumer groups, rebalancing (range/round-robin/sticky/cooperative), retention, delivery semantics. https://kafka.apache.org/documentation/
Designing Data-Intensive Applications (Kleppmann) — ch. 11 (stream processing) and "turning the database inside out."
Kafka: The Definitive Guide (Narkhede et al.).

Contrasting brokers

Apache Pulsar (segmented log + tiered storage), NATS JetStream (subjects + streams), RabbitMQ (queues/exchanges, per-message ack). Know the log-vs-queue distinction and when each fits.

db-03 (write-ahead log) — the same append-only primitive for recovery.
Event sourcing / CQRS (microservices.io) — the log as source of truth.
Sticky/cooperative rebalancing (KIP-429) — minimal-movement assignment, the gw-04/pa-06 stability theme.

Cross-lab links

pa-03 (the patterns that run on this log), pa-05 (outbox → log), pa-06 (partitioning + rebalancing for data), pa-09 (consumer lag as an SLI), gw-04 (minimal-movement reassignment).

pa-04 — Analysis

What the log must get right

Per-partition ordering with monotonic, absolute offsets that never repeat or shift (even after retention).
Key→partition stability: a key always maps to the same partition.
Independent consumer groups with resumable committed offsets.
Assignment that divides partitions across a group (parallelism capped at partition count).
Retention that drops old records without invalidating offsets.

Design decisions

Absolute offsets, prefix-truncation retention. Survivors keep their offsets so committed consumer offsets stay valid across retention — the property that makes "resume where I left off" robust.
Key hashing for partitioning. Deterministic, stateless, and it encodes the ordering guarantee (same key = same partition = ordered). The cost is skew from hot keys — surfaced, not hidden.
Offsets per (group, partition). Groups are independent; reading doesn't consume; replay is just "commit a lower offset." This is the log/queue distinction made concrete.
Two assignors. Range and round-robin match Kafka's classics and let tests pin exact distributions; sticky/cooperative is the production upgrade (minimal movement).

Tradeoffs worth flagging

Ordering vs parallelism. Per-partition ordering means global order needs one partition (no parallelism). Choose the key to order only what must be ordered.
Partition count is sticky. It caps consumer parallelism and, once set, changing it reshuffles key→partition (a migration, gw-12). Size with headroom.
Retention vs replay/lag. Short retention saves storage but risks data loss for slow/restarted consumers; long retention costs storage. Monitor lag (HW − committed) and alert before the cliff (pa-09).
Rebalancing is disruptive. Naive reassignment stops the group; sticky/cooperative minimizes movement — the same concern as gw-04.

What production adds beyond this lab

Durable, replicated storage (segments, leader/follower replication — db-17), not in-memory slices.
Sticky/cooperative rebalancing, transactions/idempotent producers, exactly-once within the broker.
Tiered storage, compaction (key-compacted topics), and time-based retention.
Consumer-lag monitoring + schema registry (pa-02) + end-to-end tracing across the async hop (gw-11).

pa-04 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-04-partitioned-log && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-04-partitioned-log/src/go
go test -race -count=1 ./...      # partitioning, ordering, groups, assignment, retention
go run ./cmd/plogsim

Package map

File	What
`plog/log.go`	partitioned append-only log: produce/consume, offsets, key partitioning, retention
`plog/groups.go`	consumer-group offset tracking + range/round-robin partition assignment
`cmd/plogsim`	partitioning + group resume + assignment + retention demo

See GUIDE.md for the deep dive and the partition-key trade-off.

pa-04 — Verification

One command

cd pa-04-partitioned-log && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestKeyMapsToSamePartition`	a key always hashes to the same partition (ordering guarantee)
`TestPerPartitionOrdering`	records for a key keep order via monotonic offsets
`TestConsumeFromOffset`	consume returns records at/after an offset; HW is correct
`TestConsumerGroupOffsets`	commit/resume works; groups are independent
`TestRangeAssignment`	5 partitions / 2 consumers → contiguous ranges, earlier gets the extra
`TestRoundRobinAssignment`	5/2 → round-robin distribution
`TestRetentionKeepsOffsetsStable`	truncation drops old records; survivors keep absolute offsets; HW unchanged

All under -race.

What "green" does NOT guarantee

In-memory, single-node. Production replicates partitions (leader/follower, db-17) with durable segments.
No sticky rebalancing / transactions. Range/round-robin only; minimal-movement reassignment is an exercise.
No consumer-lag alerting wired — the SLI is an exercise (pa-09).

pa-05 — Delivery Semantics: Outbox, Idempotency & Sagas

pa-03/04 gave you events and a log; this lab confronts the two hardest problems they create, both squarely in the Apple JD's "fault tolerance" and event-driven scope:

The dual-write problem — a service must update its database and publish an event. Two systems, no shared transaction: a crash between them leaves state and events permanently disagreeing. The transactional outbox solves it.
Distributed transactions across services — a workflow spans orders, payments, and inventory. You can't (and shouldn't) hold a 2PC lock across them. A saga — forward steps plus compensating actions — gives you atomicity-of-outcome without distributed locks.

You build both, with crash recovery, and prove them with tests.

1. What is it?

Delivery semantics describe what happens to a message under failure:

at-most-once:  may lose, never duplicate.
at-least-once: never lose, may duplicate.   ← the practical default
exactly-once:  a myth end-to-end; approximated by at-least-once + idempotency.

The transactional outbox makes a service's state change and its outgoing event atomic: both are written in one local DB transaction (the event goes into an outbox table). A separate relay later reads unpublished outbox rows and publishes them to the broker (pa-04), marking them sent. Because publish-then-mark isn't atomic, the relay is at-least-once — so consumers must be idempotent (pa-02/03).

A saga is a sequence of local transactions, each with a compensating transaction that semantically undoes it. Run forward; if a step fails, run the compensations for completed steps in reverse. Two flavors: orchestration (a central coordinator drives the steps — what we build) and choreography (services react to each other's events — pa-03). Crash recovery requires persisting saga progress so a restart resumes (forward) or completes compensation.

2. Why does it matter?

The dual-write problem is everywhere and silently corrupts data. "Update the order, then publish OrderCreated" looks innocent and is a landmine: a crash in between means downstream never hears about the order (or hears about one that rolled back). Architects must recognize it on sight and reach for the outbox (or change-data-capture). This is a near-certain interview probe.
Distributed 2PC doesn't scale and blocks on coordinator failure. Locking three services' databases together kills availability and throughput. Sagas trade ACID isolation for eventual consistency with guaranteed outcome (committed-or-compensated) — the pragmatic answer for cross-service workflows.
"Exactly-once" claims are a red flag. The senior answer is at-least-once delivery + idempotent consumers = effectively-once. The outbox is the producer side of that; idempotency (pa-02) is the consumer side. Stating this crisply marks experience.
Compensation is a domain-modeling skill. You can't rollback() a shipped package or a sent email; you compensate (refund, recall, apologize). Designing compensations forces clarity about what each step really commits — an architecture-level concern.

3. How does it work?

The transactional outbox

ONE local transaction:                    a separate relay process:
  UPDATE orders SET status='CREATED'        SELECT * FROM outbox WHERE NOT published
  INSERT INTO outbox (event)                FOR each: publish to broker; mark published
                                            (crash between publish & mark -> re-publish)

DB.AtomicWrite(key, val, event) commits both together. Relay.PollOnce publishes unpublished rows and marks them. DualWriteBroken shows the anti-pattern losing an event on a crash. The relay's PollOnceCrashBeforeMark

PollOnce demonstrates at-least-once re-publication, absorbed by an IdempotentSink. (Change-data-capture — tailing the DB log, e.g. Debezium — is the alternative to a polling relay; same guarantee.)

Idempotent consumers

Because the relay is at-least-once, the consumer dedups by event id (IdempotentSink): a re-published event is delivered once. This is the same effectively-once recipe as pa-02 (idempotency keys), pa-03 (dedup), and gw-05 (push). The outbox guarantees no loss; idempotency guarantees no double-apply.

Sagas (orchestrated, with compensation)

forward:   reserve-inventory → charge-payment → schedule-shipment
on failure at charge-payment:
compensate: (charge not done) → release-inventory   [reverse order]

Saga.Run executes forward; on a step error it runs Undo for completed steps in reverse. The result records what completed and what was compensated. Compensations are semantic undo (refund, not "un-charge").

Crash recovery

A real saga persists progress after each step (OnProgress) so a restart can resume. Saga.RunFrom(start) models this: it skips the already-completed prefix and continues — or, on a later failure, compensates everything completed (including the pre-crash steps) in reverse. Without persisted progress, a crash leaves a half-done workflow with no way to know what to undo.

4. Core terminology

Term	Definition
Dual-write problem	Updating a DB and publishing an event non-atomically; a crash between desyncs them.
Transactional outbox	Writing the event into the DB in the same transaction as the state change; a relay publishes it later.
Relay / CDC	The process that publishes outbox rows (polling) or tails the DB log (change-data-capture).
At-least-once	Delivery that never loses but may duplicate (the relay's guarantee).
Idempotent consumer	Dedups duplicates so re-delivery is effectively-once.
Saga	A sequence of local transactions with per-step compensating actions.
Compensation	A semantic undo of a completed step (refund, recall).
Orchestration / choreography	Central coordinator vs services reacting to events.
2PC	Two-phase commit — distributed ACID; blocks on coordinator failure, doesn't scale.
Effectively-once	At-least-once delivery + idempotent consumer.

5. Mental models

The outbox is "put the letter in the same envelope as the deed." You don't sign the deed (commit state) and then separately mail the notification (publish) — a fire between the two loses the notice. Instead you seal both in one envelope (one transaction); a courier (the relay) mails it whenever, even if you've gone home.
A saga is checkout with a returns desk, not an escrow lock. 2PC freezes everyone's money in escrow until all agree (slow, blocking). A saga lets each shop complete its sale, and if a later shop declines, you walk the receipts back and get refunds (compensate). Eventually consistent, never blocking.
You can't undo, you can only compensate. Time only runs forward. "Un-ship" isn't a thing; "issue a recall + refund" is. Modeling the compensation forces you to admit what a step truly commits in the real world.
At-least-once + idempotency = effectively-once. Repeat it until it's reflex. The outbox guarantees the event is eventually delivered (≥1 times); the idempotent consumer guarantees the effect happens once. Neither alone is enough.

6. Common misconceptions

"Just update the DB and publish — what could go wrong?" A crash between them. It's the dual-write problem and it will happen at scale. Use the outbox or CDC; never best-effort dual writes for events that matter.
"Use a distributed transaction (2PC) across services." It blocks on coordinator failure, couples availability, and doesn't scale. Sagas are the answer for cross-service workflows; reserve 2PC for tightly-coupled resources within one boundary, if ever.
"Sagas give you rollback." They give you compensation, which is not the same: there's a window where partial effects are visible (someone saw the inventory reserved), and compensations can themselves fail (needing retries/alerting). Sagas are eventual consistency, not isolation.
"The outbox gives exactly-once." It gives at-least-once publication. You still need idempotent consumers. The outbox solves loss, not duplication.
"Choreography is always simpler than orchestration." For 2-3 steps, yes; for complex workflows with compensation and visibility needs, choreography scatters the logic across event handlers and becomes un-debuggable. Orchestration centralizes the workflow (at the cost of a coordinator). Trade-off, not dogma.

7. Interview talking points

"A service updates its DB and publishes an event — what's wrong?" The dual-write problem: non-atomic, so a crash desyncs state and events. Fix with the transactional outbox (event in the same DB transaction; a relay or CDC publishes it) — at-least-once, so consumers are idempotent. This is the canonical answer; have it instant.
"How do you do a transaction across three services?" You don't — not 2PC. A saga: local transactions + compensating actions, orchestrated or choreographed, persisting progress for crash recovery. Accept eventual consistency (committed-or-compensated) and design the compensations.
"Exactly-once?" Doesn't exist end-to-end. At-least-once delivery + idempotent consumers = effectively-once. Outbox (no loss) + idempotency (no double-apply).
"Orchestration vs choreography for a saga?" Orchestration: a coordinator drives steps — visible, controllable, a central component. Choreography: services react to events — decoupled, but the workflow is emergent and hard to see/debug. Pick by workflow complexity and observability needs.
"What happens if a compensation fails?" It must be retryable (idempotent) and alert/escalate if it can't complete — there's no further automatic recovery, so it becomes an operational/SLO concern (pa-09). Design compensations to be as reliable as the forward steps.
"Outbox relay vs change-data-capture?" Polling the outbox table is simple and DB-agnostic; CDC (tailing the DB write-ahead log, e.g. Debezium) avoids polling latency and load but couples to the DB's log. Same at-least-once guarantee; pick by latency and operational fit.

8. Connections to other labs

pa-02 (idempotency keys) — the consumer side that makes the outbox's at-least-once safe.
pa-03 (event-driven) — the outbox publishes to that bus; sagas are orchestrated event workflows; choreography is the event-reaction alternative.
pa-04 (log) — the relay publishes to a partitioned log; CDC tails a log just like it (db-03).
pa-09 (reliability) — failed compensations and DLQ'd events become SLO/alerting concerns.
gw-05 (push) — at-least-once + dedup at the edge; db-13/16 — transactions and the distributed-consistency fundamentals sagas relax.

pa-05 — The Hitchhiker's Guide to Outbox & Sagas

Companion to CONCEPTS.md, with the runnable patterns in src/go/delivery/. Two of the most-asked distributed-systems interview questions live here: the dual-write problem and cross-service transactions.

bash scripts/verify.sh runs the demo, and it's the whole lesson in two blocks:

broken dual write + crash:     state=CREATED, events published=0   <- LOST EVENT
outbox + crash + recovery:     delivered=1 duplicates=1            <- effectively once, no loss

saga: payment fails -> compensate in reverse
  [do]   reserve-inventory
  [do]   charge-payment FAILED
  [undo] reserve-inventory

1. The dual-write problem and the outbox (outbox.go)

A service updates its DB and publishes an event. These are two systems with no shared transaction. DualWriteBroken shows the failure: AtomicWrite-then-publish with a crash in between leaves state=CREATED but published=0 — the order exists, downstream never hears, permanent inconsistency. TestOutboxNoLostEventsVsBrokenDualWrite asserts that divergence.

The fix is AtomicWrite(key, val, event): the event is written into the DB in the same transaction as the state change, so they can never disagree (StateCount == OutboxCount, always). A separate relay (PollOnce) publishes unpublished outbox rows and marks them sent. Because publish-then-mark isn't atomic, the relay is at-least-once — TestOutboxAtomicAndAtLeastOnce simulates a crash after publish (PollOnceCrashBeforeMark), then recovery re-publishes, and the IdempotentSink absorbs the duplicate: delivered=1, duplicates=1. The outbox guarantees no loss; idempotency guarantees no double-apply; together = effectively-once.

Production note: a polling relay is the simple version; change-data- capture (tailing the DB's write-ahead log, e.g. Debezium) is the low-latency version. Same guarantee — it's db-03's WAL, read by a publisher.

2. Sagas: transactions without 2PC (saga.go)

You can't hold a 2PC lock across orders + payments + inventory (it blocks on coordinator failure and doesn't scale). A saga runs local transactions with compensating actions. Saga.Run executes forward; on a failure it runs Undo for completed steps in reverse order. TestSagaCompensatesInReverse: reserve OK, charge OK, ship FAILS → compensate [charge, reserve] (reverse). TestSagaHappyPath: all succeed, nothing compensated.

The architecture insight: you compensate, you don't roll back. You can't un-ship a package; you issue a refund/recall. Modeling each step's compensation forces clarity about what it really commits — and there's a visible window where partial effects exist (someone saw the inventory reserved before the refund). Sagas buy eventual consistency (committed-or-compensated), not isolation.

3. Crash recovery (saga.go)

A real orchestrator persists progress after each step (the OnProgress hook) so a restart knows where it was. Saga.RunFrom(start) models resumption: TestSagaResumeAfterCrash "crashes" after 2 forward steps and resumes from index 2 — step c runs once, steps a/b do not re-run. Without persisted progress, a crashed saga is a half-done workflow with no way to know what to undo — the difference between a recoverable system and a manual-cleanup incident.

4. The trade-offs an architect owns

Outbox vs CDC vs best-effort. Best-effort dual writes lose events — never use them for events that matter. Outbox (polling) is simple; CDC is lower-latency. Both are at-least-once.
Saga vs 2PC. 2PC for tightly-coupled resources in one boundary (if ever); sagas for cross-service workflows. Sagas trade isolation for availability + scale.
Orchestration vs choreography. Central coordinator (visible, controllable) vs event-reaction (decoupled, emergent). Complexity of the workflow decides.
Compensation reliability. A failed compensation has no automatic recovery — it's an SLO/alert concern (pa-09). Make compensations idempotent and as reliable as the forward steps.

5. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/sagasim

6. Exercises

Publish to the real log: make the relay sink a pa-04 PartitionedLog; have an idempotent consumer (pa-03) process it.
CDC instead of polling: model a relay that tails an append-only log (db-03) of committed writes rather than polling a table.
Choreographed saga: rebuild the order saga as services reacting to each other's events (pa-03), then compare debuggability to this orchestrated version.
Compensation failure: make a compensation fail and add retry-with-backoff + an alert when it can't complete (pa-09).
Persisted orchestrator: store OnProgress to a file/DB and prove a process restart resumes the exact saga via RunFrom.

pa-05 — References

Outbox & dual-write

Chris Richardson, microservices.io — Transactional Outbox, Polling Publisher, Transaction Log Tailing. https://microservices.io/patterns/data/transactional-outbox.html
Confluent — Transactional Outbox with Kafka + Debezium (CDC).
Designing Data-Intensive Applications (Kleppmann) — the dual-write problem; change-data-capture; deriving streams from databases.
Debezium docs — change-data-capture (the CDC alternative to polling).

Sagas & distributed transactions

Garcia-Molina & Salem, Sagas (1987) — the original paper.
Chris Richardson — Saga pattern (orchestration vs choreography). https://microservices.io/patterns/data/saga.html
Caitie McCaffrey, Distributed Sagas (talk) — sagas done right.
Pat Helland, Life Beyond Distributed Transactions — why 2PC doesn't scale and what to do instead.

Delivery semantics

"You Cannot Have Exactly-Once Delivery" (Tyler Treat).
Kafka transactions / idempotent producer docs (exactly-once within the broker boundary).

Cross-lab links

pa-02 (idempotency keys), pa-03 (events/choreography), pa-04 (the log the relay publishes to), pa-09 (failed- compensation as an SLO concern), db-03 (WAL = what CDC tails), db-13/16 (transactions & consistency sagas relax).

pa-05 — Analysis

What the patterns must get right

Atomic state+event (outbox): they can never diverge (StateCount == OutboxCount).
At-least-once relay: a crash between publish and mark re-publishes; the idempotent consumer makes it effectively-once.
Saga compensation in reverse for exactly the completed steps.
Crash recovery: resume forward from persisted progress, or compensate everything completed on failure.

Design decisions

Event in the same transaction as state. The whole point: one commit, no dual write. The relay is decoupled and may be at-least-once.
Don't make publish+mark atomic. Forcing that would recreate a distributed transaction; instead embrace at-least-once + idempotency — the honest, scalable choice.
Compensation is semantic undo, recorded in reverse. The result tracks completed-forward and compensated lists so the outcome is auditable.
RunFrom for resumption. Progress is an index persisted via a hook; resume skips the completed prefix. This models a persisted orchestrator without a real datastore.

Tradeoffs worth flagging

Outbox latency vs CDC complexity. Polling adds latency + DB load; CDC removes the poll but couples to the DB log and adds a connector to operate. Same guarantee.
Saga = eventual consistency, not isolation. Partial effects are visible during the workflow; consumers must tolerate "reserved but not yet charged" states. That's the cost of avoiding 2PC.
Compensations can fail. No automatic recovery beyond retries; a stuck compensation is an operational/SLO event (pa-09). Design them idempotent and reliable.
Orchestration adds a component. The coordinator is a service to build, scale, and make HA; choreography avoids it but scatters the workflow.

What production adds beyond this lab

A real DB transaction + CDC (Debezium) or a robust polling publisher.
A durable, replicated orchestrator (state machine persisted per saga) with retries, timeouts, and alerting on stuck sagas.
Idempotent consumers + a schema registry (pa-02/03) on the published events; DLQ + replay for poison events (pa-03).
End-to-end tracing across the async hops (gw-11).

pa-05 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-05-delivery-semantics-saga && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-05-delivery-semantics-saga/src/go
go test -race -count=1 ./...      # outbox at-least-once, saga compensation + resume
go run ./cmd/sagasim

Package map

File	What
`delivery/outbox.go`	atomic state+event write, relay (at-least-once), idempotent sink, broken-dual-write contrast
`delivery/saga.go`	orchestrated saga: forward steps + reverse compensation + RunFrom (crash resume)
`cmd/sagasim`	outbox vs broken dual write; saga happy-path + compensation

See GUIDE.md for the deep dive.

pa-05 — Verification

One command

cd pa-05-delivery-semantics-saga && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestOutboxAtomicAndAtLeastOnce`	crash-before-mark re-publishes; idempotent sink → effectively-once; no unpublished left
`TestOutboxNoLostEventsVsBrokenDualWrite`	broken dual write loses the event; outbox keeps state↔event 1:1
`TestSagaHappyPath`	all forward steps run; nothing compensated
`TestSagaCompensatesInReverse`	a mid-saga failure compensates completed steps in reverse order
`TestSagaResumeAfterCrash`	resuming from persisted progress runs only remaining steps

All under -race.

What "green" does NOT guarantee

No real DB/CDC. The "transaction" is a mutex; production uses a DB transaction + Debezium or a robust polling publisher.
No durable orchestrator. RunFrom models resume; a real saga persists per-saga state with retries/timeouts/alerting.
Compensation reliability is on you. Failed compensations need retries + alerting (pa-09); not handled here.

pa-06 — Partitioning Strategies & Consistency Models

The Apple JD asks for a "strong understanding of distributed systems fundamentals: consistency models, fault tolerance, and partitioning strategies." This lab makes two of the three concrete and runnable (fault tolerance threads through the whole book): how you split data across nodes (partitioning) and what guarantees you get when you replicate it (consistency / quorums). The third leg — consensus — you already built in db-16…20.

You build consistent hashing (with virtual nodes) and measure that it moves ~1/N of keys on a membership change versus mod-N's near-total reshuffle; and a quorum model that demonstrates the R+W>N overlap rule that separates strong from eventual consistency.

1. What is it?

Partitioning (sharding) splits a dataset across nodes so it scales beyond one machine and one machine's failure. The strategies:

Range — contiguous key ranges per node (great for range scans; risk of hotspots at the ends).
Hash — node = hash(key) % N (even spread; kills range scans; catastrophic resharding when N changes — almost every key moves).
Consistent hashing — nodes and keys placed on a hash ring; a key belongs to the next node clockwise. Adding/removing a node moves only the keys near it (~1/N), not everything. Virtual nodes (many ring positions per physical node) fix balance and smooth movement.

Replication copies each partition to R nodes for fault tolerance. Consistency models describe what a reader may observe:

strong / linearizable: every read sees the latest committed write (as if one copy).
sequential / causal:   reads respect some/causal order, not necessarily real-time.
read-your-writes:      you always see your own writes.
eventual:              replicas converge "eventually"; reads may be stale.

Quorums tune the strong↔eventual dial: with N replicas, write to W, read from R; R+W>N guarantees the read and write sets overlap, so the read sees the latest write (strong). R+W<=N allows staleness (eventual), with higher availability/lower latency.

2. Why does it matter?

Partitioning decides scalability and hotspots. The partition key (like pa-04's) determines whether load spreads evenly or one shard melts. And the strategy decides what a resharding event costs: mod-N resharding is a full data shuffle (an outage-grade migration); consistent hashing makes it a ~1/N background rebalance. Architects choose this; getting it wrong is a rewrite.
Consistency is a per-dataset, per-operation decision with real cost. Strong consistency costs latency and availability (CAP/PACELC); eventual buys both back. The architect's job is to pick the weakest model each dataset can tolerate — strong for balances, eventual for a "likes" counter — not to blanket "strong everywhere."
The R+W>N knob is the practical face of CAP. Dynamo-style systems (Cassandra, Riak, DynamoDB) expose exactly W and R so you tune consistency vs latency/availability per query. Knowing the overlap math cold is a fundamentals checkpoint.
Minimal-movement-under-change is a recurring architecture theme. Consistent hashing (here), consumer-group rebalancing (pa-04), and gateway subsetting (gw-04) all solve "redistribute work when membership changes without moving everything." Recognizing the shared pattern is senior-level synthesis.

3. How does it work?

Consistent hashing with virtual nodes

hash ring (0 .. 2^32):
   a#3        b#1     c#2      a#1        c#5     b#4
   ──●─────────●───────●────────●──────────●───────●──▶ (wraps)
   key k hashes here ──┘ belongs to the next vnode clockwise -> a#1 -> node a

Ring.AddNode places a node at vnodes positions; Get(key) finds the first position clockwise (binary search + wrap). Removing a node (RemoveNode) deletes its positions; only keys that pointed at those positions move — to the next node clockwise. Measured: removing 1 of 3 nodes moves ~32% of keys; mod-N moves ~67%. Virtual nodes (many per physical node) make the per-node load even and the movement smooth (a removed node's keys spread across all survivors, not dumped on one neighbor).

Quorums and the overlap guarantee

N=3 replicas.  Write W, Read R.
  R+W>N  -> read set ∩ write set ≠ ∅ (pigeonhole) -> read sees latest = STRONG
  R+W<=N -> sets can be disjoint -> read may be STALE = EVENTUAL

Cluster.Write(W, version) updates W replicas; ReadWorstCase(R) reads the R replicas least likely to hold the write — so if even that read sees the new version, overlap is guaranteed. QuorumOverlaps(N,W,R) returns W+R>N. Tunings: W=N,R=1 (fast reads, slow/availability- sensitive writes); W=1,R=N (fast writes, slow reads); W=R=quorum (balanced strong, the common default).

CAP / PACELC (the trade-off you're tuning)

CAP: under a network Partition you must choose Consistency or Availability. R+W>N with a partition means some reads/writes can't reach a quorum → you sacrifice availability to keep consistency (or relax the quorum to stay available, sacrificing consistency).
PACELC: even when there's no partition (Else), you trade Latency vs Consistency (a strong read waits for a quorum). So consistency costs latency always, not just during partitions.

4. Core terminology

Term	Definition
Partitioning / sharding	Splitting data across nodes for scale + fault isolation.
Range / hash / consistent hashing	Contiguous ranges / `hash%N` / hash-ring placement.
Virtual nodes	Multiple ring positions per physical node; even load + smooth movement.
Rebalancing	Redistributing partitions when membership changes (minimize movement!).
Hotspot / skew	A partition receiving disproportionate load (bad key choice).
Replication factor	How many copies of each partition exist.
Quorum (N/W/R)	Replica count / write-acks / read-replicas.
R+W>N	The overlap condition for strong (read-your-writes) consistency.
Linearizable / causal / eventual	Strongest → weakest consistency models.
CAP / PACELC	Consistency vs availability (under partition) / vs latency (else).

5. Mental models

Consistent hashing is musical chairs with assigned seats around a circle. Each key sits in the next occupied chair clockwise. Remove a chair (node) and only the people who were sitting there get up and move to the next chair — everyone else stays put. Mod-N is "renumber every seat," so the whole room shuffles.
Virtual nodes are giving each player many chairs. One chair per player means uneven gaps (some players hog huge arcs) and a removed player dumps their whole arc on one neighbor. Scatter each player's many chairs around the ring and load evens out and a departure spreads across everyone.
R+W>N is the overlapping-Venn-diagrams rule. Draw the W replicas you wrote and the R you read as two sets in a universe of N. If their sizes sum to more than N, they must intersect (pigeonhole) — and that shared replica has the latest write. Shrink them so they fit side-by-side (R+W≤N) and a read can miss the write.
Consistency is a dial, not a switch. Per dataset, pick the weakest model that's still correct: strong for money, causal for comments, eventual for view counts. "Strong everywhere" is paying latency and availability tax you usually don't need.

6. Common misconceptions

"Hash partitioning (mod-N) is fine." Until you add a node — then ~all keys remap and you move the whole dataset (an outage). Consistent hashing (or a fixed large partition count, Kafka-style) is the scalable choice.
"More replicas = stronger consistency." Replication is for fault tolerance/availability; consistency comes from the W/R quorum, not the replica count. You can have many replicas and still read stale data if R+W≤N.
"Strong consistency everywhere." It costs latency on every read (PACELC) and availability under partition (CAP). Most data tolerates weaker models; forcing strong everywhere is over-engineering that hurts p99 and uptime.
"Consistent hashing balances perfectly." Only with enough virtual nodes. One position per node gives lumpy arcs and a bad departure dumps load on one neighbor. Vnodes (100s) are what make it even.
"CAP means pick 2 of 3." The modern reading: partitions will happen, so you're really choosing C or A during a partition, and (PACELC) C vs L the rest of the time. It's a per-operation dial, not a one-time architecture badge.

7. Interview talking points

"How do you partition data?" Strategy by access pattern: range for scans (watch hotspots), hash for even spread (but mod-N reshards badly), consistent hashing with vnodes for elastic membership (~1/N movement). Choose the partition key to avoid skew. Name the resharding cost of each.
"Explain R+W>N." With N replicas, writing W and reading R, if R+W>N the read and write quorums overlap (pigeonhole) so the read sees the latest write — strong/read-your-writes. R+W≤N permits staleness (eventual) for better latency/availability. Give example tunings (W=N/R=1, W=1/R=N, W=R=quorum).
"Consistency models, weakest you can use?" Linearizable (money) → causal (chat/comments) → read-your-writes (your own profile) → eventual (view counts/likes). Pick per dataset; strong everywhere is a latency/availability tax.
"CAP vs PACELC?" CAP: under a partition, choose C or A. PACELC: even without a partition, choose latency or consistency. So consistency costs you always, and the architect decides where it's worth it.
"Why consistent hashing over mod-N?" Membership changes: mod-N remaps ~all keys (full reshuffle = migration/outage); consistent hashing moves ~1/N. Vnodes give even load + smooth departures. Same minimal-movement goal as Kafka consumer rebalancing (pa-04) and gateway subsetting (gw-04).
"How do you keep stale replicas converging?" Read-repair (fix stale replicas seen during a read), anti-entropy (background Merkle-tree sync), and hinted handoff (store writes for a down replica) — the Dynamo toolkit on top of quorums.

8. Connections to other labs

db-16…20 (consensus) — the third fundamental (strong consistency via a replicated log); quorums here are the tunable, Dynamo-style alternative to full consensus.
pa-04 (log) — partitioning + rebalancing for streams; the same minimal-movement concern.
gw-04 (subsetting) — the Van der Corput ring is consistent-hashing- adjacent: balanced, minimal-movement assignment under membership churn.
pa-09 (reliability) — consistency/availability choices are SLO decisions; quorum tuning is a reliability lever.
db-13 (MVCC) — single-node consistency; this lab is its distributed, replicated counterpart.

pa-06 — The Hitchhiker's Guide to Partitioning & Consistency

Companion to CONCEPTS.md, with the runnable code in src/go/partition/. Two distributed-systems fundamentals the JD names, made measurable.

bash scripts/verify.sh runs the demo:

key movement when removing 1 of 3 nodes:
  consistent hashing: 32% of keys moved
  mod-N hashing:      67% of keys moved  <- reshuffles almost everything
quorum overlap (N=3): R+W>N => read sees latest write
  W=2 R=2  overlap=true   read=v2  STRONG
  W=1 R=1  overlap=false  read=v1  STALE

Two numbers and a table that encode the whole lab.

1. Consistent hashing vs mod-N (ring.go)

The headline distributed-systems result every architect should be able to derive: how much data moves when the cluster changes size.

ModHashGet is the naive node = sortedNodes[hash%N]. It's perfectly even — until N changes, when the modulus changes and almost every key remaps. TestModHashMovesEverything measures ~67% moved going 3→2 nodes. That's a full data shuffle = an outage-grade migration on every scaling event.

Ring places each node at vnodes positions on a hash circle; Get finds the next position clockwise. Removing a node deletes its positions, so only the keys that pointed there move — to the next node clockwise. TestConsistentHashingMinimalMovement measures ~32% (≈ the departed node's 1/3 share) and TestConsistentHashingBalance confirms 200 vnodes keep each node within ~25% of the ideal load. The vnodes are what make it both balanced (no lumpy arcs) and smooth (a departure spreads across all survivors, not dumped on one neighbor).

This is the same minimal-movement-under-membership-change problem as Kafka consumer rebalancing (pa-04) and gateway subsetting (gw-04) — recognizing the shared pattern is the synthesis an architect brings.

2. Quorums and R+W>N (quorum.go)

Replication gives fault tolerance; quorums give consistency. With N replicas, you write W and read R. Cluster.Write(W, version) updates W replicas; ReadWorstCase(R) reads the R replicas least likely to have the write — the adversarial staleness case. The result:

TestQuorumStrongWhenOverlap: W=2,R=2,N=3 → R+W>N → the read sees v2 (strong / read-your-writes).
TestQuorumStaleWhenNoOverlap: W=1,R=1,N=3 → R+W≤N → the read can see v1 (stale / eventual).

The proof is pigeonhole: if W+R>N, the write set and read set can't be disjoint within N replicas, so they share at least one replica — and it holds the latest write. This single inequality is the dial between strong and eventual consistency, exposed directly by Dynamo-style systems (Cassandra, DynamoDB) as tunable per query.

3. The architect's decisions

The code is small; the decisions are the seniority:

Partition key — decides skew (hot shards) and what's co-located. Same weight as pa-04's key choice.
Partition strategy — range (scans, hotspots), hash (even, bad reshard), consistent hashing (elastic). The reshard cost is the deciding factor.
Consistency per dataset — the weakest model that's still correct. Strong for balances (W=R=quorum), eventual for counters (W=1,R=1). Not "strong everywhere."
CAP/PACELC posture — under a partition, C or A; otherwise, L or C. Consistency costs latency always (a strong read waits for a quorum), so spend it deliberately.
Convergence — eventual systems need read-repair, anti-entropy, and hinted handoff to actually converge; quorums alone don't heal stale replicas.

4. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/partsim

Vary the vnode count and watch balance/movement change; vary N/W/R and watch the overlap flip.

5. Exercises

Vnode sweep: plot load imbalance (max/min per node) vs vnode count (1, 10, 100, 1000) and find where it flattens.
Weighted nodes: give a bigger node more vnodes and show it takes proportionally more load (heterogeneous clusters).
Replication on the ring: store each key on the next R distinct physical nodes clockwise (Dynamo-style preference lists), then layer the W/R quorum on top.
Read-repair: when ReadWorstCase sees a newer version on some replica, write it back to the stale ones; show convergence.
CAP under partition: split the replicas into two groups and show a quorum write fails in the minority side (choosing C over A).

pa-06 — References

Foundational

Designing Data-Intensive Applications (Kleppmann) — ch. 5 (replication), ch. 6 (partitioning), ch. 9 (consistency/consensus). The single best treatment.
Dynamo: Amazon's Highly Available Key-value Store (DeCandia et al., 2007) — consistent hashing + vnodes, quorums (N/W/R), read-repair, hinted handoff, anti-entropy.
Karger et al., Consistent Hashing and Random Trees (1997) — the original.

Consistency & CAP

Brewer, CAP; Abadi, PACELC (consistency vs latency, always).
Vogels, Eventually Consistent; Bailis, Highly Available Transactions / the "Consistency without consensus" line.
Jepsen analyses (Cassandra, etcd, Mongo) — consistency claims tested under partition. https://jepsen.io/analyses

Systems to study

Cassandra / ScyllaDB (tunable quorums, vnodes), DynamoDB, Riak — Dynamo-style. etcd / Spanner / CockroachDB — consensus-based strong consistency (db-17).

Cross-lab links

db-16…20 (consensus = the strong-consistency leg), pa-04 (stream partitioning + rebalancing), gw-04 (subsetting ring), pa-09 (consistency/availability as SLO choices).

pa-06 — Analysis

What the code must get right

Minimal movement: removing 1 of N nodes moves ~1/N keys on the ring; mod-N moves ~all.
Balance: enough vnodes keep per-node load near the ideal.
Quorum overlap: R+W>N ⇒ a (worst-case) read sees the latest write; R+W≤N ⇒ it may be stale.

Design decisions

Vnodes per physical node. One position per node gives lumpy arcs and dumps a departed node's load on one neighbor; many positions even the load and spread departures. The test uses 200.
Clockwise-next ownership + binary search. Standard consistent- hashing lookup; deterministic and O(log positions).
Worst-case read for the quorum demo. Reading the replicas least likely to hold the write makes the overlap guarantee unambiguous: if even that read is fresh, R+W>N held.
Versions, not values. Consistency is about recency; a monotonic version is the minimal model that demonstrates staleness.

Tradeoffs worth flagging

Range vs hash vs consistent. Range enables scans but hotspots; hash spreads but reshards catastrophically; consistent hashing is elastic but loses range scans. Choose by access pattern + elasticity.
Consistency vs latency/availability. R+W>N costs a quorum round-trip (latency, PACELC) and fails under partition (availability, CAP). Tune per dataset; don't default to strong everywhere.
Quorums ≠ convergence. Quorums make a read fresh; stale replicas still need read-repair/anti-entropy/hinted-handoff to converge.
Vnode count is a knob. Too few = imbalance; too many = metadata + lookup cost. Sweep to find the knee (exercise).

What production adds beyond this lab

Replication placement (preference lists), read-repair, anti-entropy (Merkle trees), hinted handoff (the Dynamo toolkit).
Conflict resolution (vector clocks/LWW/CRDTs) for concurrent writes.
Bounded staleness / session guarantees; coordination with consensus (db-17) for the strong-consistency datasets.

pa-06 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-06-partitioning-consistency && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-06-partitioning-consistency/src/go
go test -race -count=1 ./...      # ring movement/balance, quorum overlap
go run ./cmd/partsim

Package map

File	What
`partition/ring.go`	consistent-hashing ring (vnodes) + naive mod-N for contrast
`partition/quorum.go`	N/W/R quorum model + the R+W>N overlap guarantee
`cmd/partsim`	key-movement comparison + quorum overlap table

See GUIDE.md for the deep dive and the CAP/PACELC framing.

pa-06 — Verification

One command

cd pa-06-partitioning-consistency && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestConsistentHashingMinimalMovement`	removing 1 of 3 nodes moves ~1/3 of keys (<50%)
`TestModHashMovesEverything`	mod-N moves most keys when N changes (the reshuffle)
`TestConsistentHashingBalance`	with 200 vnodes, each node is within ~25% of ideal load
`TestQuorumStrongWhenOverlap`	R+W>N → a worst-case read sees the latest write
`TestQuorumStaleWhenNoOverlap`	R+W≤N → a read may be stale

All under -race.

What "green" does NOT guarantee

No real replication/convergence. Read-repair, anti-entropy, hinted handoff, and conflict resolution are exercises / the Dynamo toolkit.
Versions, not values. Consistency is modeled by recency; concurrent- write conflict resolution (vector clocks/CRDTs) is out of scope.
Consensus is elsewhere. Strong consistency via a replicated log is db-16…20.

pa-07 — Infrastructure as Code (Terraform / Pulumi model)

The Apple JD wants familiarity with "infrastructure-as-code tools, such as Terraform or Pulumi." Under both sits one engine: take a declarative description of desired infrastructure, diff it against the current state, and converge the real world to match — in dependency order, idempotently, with drift detection. Build that engine once and Terraform/Pulumi stop being magic.

You build it: resources with a dependency DAG, plan (the diff), apply (topological convergence), persistent state, idempotent re-apply, and drift detection.

1. What is it?

Infrastructure as Code manages infrastructure (networks, DBs, clusters, DNS) through machine-readable declarative definitions rather than manual clicks or imperative scripts. You declare what you want (the desired state); the engine figures out how to get there.

The engine's loop:

desired config  ─┐
                 ├─▶ PLAN (diff desired vs state) ─▶ create/update/delete/no-op
current STATE  ──┘                                          │
                                                            ▼
                              APPLY (in dependency order) ─▶ converge real world + update STATE
                                                            │
                                              DRIFT: state vs the actual world (changed out of band)

Declarative, not imperative. You say "a VPC and a subnet in it exist," not "if no VPC, create one, then…". The engine computes the delta.
Dependency graph. Resources declare depends_on; the engine builds a DAG and applies in topological order (the VPC before the subnet before the DB).
State. The last-applied snapshot, so the next plan knows what to diff against (Terraform's state file — and a notorious source of pain).
Idempotent. Applying the same config twice changes nothing.
Drift. The world can change outside the tool (a manual edit); drift detection finds where reality diverged from state.

2. Why does it matter?

It's how a platform is reproducible and reviewable. Infra-as-code means environments are versioned in git, peer-reviewed (a testing strategy, pa-10), diffable before apply (plan), and rebuildable from scratch. "Click-ops" is the opposite: unreproducible, unauditable, drift-prone.
Declarative + reconcile is the dominant control paradigm. This exact loop — desired vs actual, converge — is Kubernetes (gw-10), GitOps (pa-08), and xDS (gw-08). An architect who sees that all four are the same idea wields it everywhere. (db-17's "drive replicas to a desired state" is the same instinct.)
The dependency DAG and ordering are where correctness lives. Create a subnet before its VPC and apply fails; delete a VPC before its subnet and you orphan or error. Topological order (and reverse for deletes) is the non-obvious engine detail; cycle detection prevents impossible configs.
State and drift are the operational hazards. A stale or corrupt state file, or undetected drift, causes the engine to plan the wrong delta (recreate live resources, or miss out-of-band changes). Knowing these failure modes is the difference between using Terraform and operating it.

3. How does it work?

Plan (the diff)

Plan(desired, state) validates dependencies exist, checks for cycles (topoSort), then for each desired resource: Create (not in state), Update (in state but attributes differ — with a diff), or NoOp (unchanged). Resources in state but not desired → Delete. Plan is read-only — you review it before committing. That preview is a core safety feature: you see "this will destroy the prod DB" before it happens.

Apply (converge, in order)

Apply runs the plan, executing creates/updates in dependency order (deps first) and deletes in reverse order, then updates state. The topological sort (Kahn's algorithm, deterministic) is what guarantees the VPC exists before the subnet that needs it.

Idempotency

Applying the same config twice yields all NoOps — because the second plan diffs desired against a state that already equals it. Idempotency is what makes IaC safe to run repeatedly (in CI, on a schedule) and is the same property as the reconcile loops in gw-08/gw-10/pa-08.

State and drift

State is the last-applied snapshot. Drift = the actual world (live) differs from state: someone hand-edited a resource, or an unmanaged resource appeared. Drift(state, live) reports both. The next apply would "correct" managed drift back to the declared config — which is exactly what GitOps self-heal (pa-08) automates continuously.

Modules, providers, and the real complexity

Real IaC adds providers (plugins that talk to AWS/GCP/k8s), modules (reusable resource bundles), variables/outputs, and remote state with locking (so two engineers don't apply concurrently and corrupt state). The engine here is the kernel those wrap.

4. Core terminology

Term	Definition
Declarative	Describe desired state; the engine computes the steps.
Plan	The read-only diff (create/update/delete/no-op) before applying.
Apply	Converge the real world to desired, in dependency order.
State	The last-applied snapshot the next plan diffs against.
Dependency DAG	Resource graph; topo order = apply order (reverse for deletes).
Idempotent	Re-applying the same config changes nothing.
Drift	The actual world diverging from state (out-of-band change).
Provider	A plugin that creates/reads/updates a real resource type.
Module	A reusable, parameterized bundle of resources.
State lock	A mutex preventing concurrent applies from corrupting state.

5. Mental models

IaC is git for infrastructure. You commit a desired state; plan is git diff; apply is the merge that makes reality match. State is the index that tells you what's already committed. Drift is "someone edited the deployed files directly instead of through git."
Declarative is a thermostat; imperative is flipping the heater. You set the target (68°F / "these resources exist") and the engine drives toward it from wherever it is. An imperative script ("turn the heater on") breaks if the room's already warm or the heater's already on; the declarative engine just no-ops.
The dependency DAG is a recipe's ordering. You can't ice the cake before baking it. Topological sort is reading the recipe to find a valid order; a cycle ("A needs B, B needs A") is a recipe that can't be cooked.
State is load-bearing and fragile. It's the engine's memory of the world. Lose it and the engine thinks nothing exists (and tries to recreate everything); corrupt it and it plans nonsense. Remote state + locking exist because this single file is the system's truth.

6. Common misconceptions

"IaC is just scripts that create infra." Scripts are imperative and not idempotent (re-running breaks or duplicates). IaC is declarative with a diff-and-converge engine — that's what makes it safe to re-run and reviewable via plan.
"Apply order doesn't matter." It's everything: dependencies dictate create order and the reverse for deletes. Get it wrong and you create orphans or fail. The DAG + topo sort is the engine's core.
"State is just a cache." It's the source of truth for what the tool manages. A wrong state file makes the engine destroy or recreate live infrastructure. Treat it like a database: remote, locked, backed up.
"Plan == apply." Plan is read-only and is your safety review; apply mutates the world. Always read the plan (especially the destroys). In CI, gate apply on a reviewed plan.
"Drift doesn't happen if everyone uses the tool." Someone always hand-fixes prod at 3am. Drift detection (and GitOps self-heal, pa-08) is what catches and reverts it — or at least surfaces it.

7. Interview talking points

"How does Terraform/Pulumi work under the hood?" Declarative resources + a dependency DAG → a plan (diff desired vs state) → apply in topological order → updated state; idempotent re-apply; drift detection vs the live world. Name state-file fragility and locking. This is the engine, and you can say you've built it.
"Why declarative over imperative scripts?" Idempotency (safe to re-run), a reviewable diff (plan) before changes, reproducibility from git, and convergence from any starting state. Scripts are none of these.
"Where does apply ordering come from?" The dependency DAG, topologically sorted (deps first for create/update, reverse for delete). Cycles are rejected. This is the same ordering concern as ADS in xDS (gw-08) and pa-01's build order.
"What is state and why is it dangerous?" The last-applied snapshot the engine diffs against. Lost/stale/corrupt state → the engine plans the wrong delta (recreates live resources, misses changes). Mitigate with remote state + locking + backups + import for adopting existing resources.
"How do you handle drift?" Detect it (compare state to the live world) and either re-apply to converge or alert. GitOps (pa-08) automates this as continuous self-heal. The same desired-vs-actual loop as Kubernetes (gw-10).
"Terraform vs Pulumi vs Crossplane?" Terraform/Pulumi: external engine + state file (Pulumi uses real languages instead of HCL). Crossplane: the reconcile loop inside Kubernetes (operators, gw-10). Same model; different host for the loop.

8. Connections to other labs

gw-08 / gw-10 (xDS / operators) — the same declarative desired-vs-actual reconcile loop, hosted in a control plane / k8s instead of an external CLI.
pa-08 (GitOps) — git is the desired state; a reconciler runs this plan/apply continuously and self-heals drift.
pa-01 (decomposition) — the dependency DAG + topological order + cycle detection is the same machinery as the service graph.
db-17 (Raft) — "drive the system to a replicated desired state" is the consensus instinct behind every reconcile loop.

pa-07 — The Hitchhiker's Guide to Infrastructure as Code

Companion to CONCEPTS.md, with the runnable engine in src/go/iac/. Build the Terraform/Pulumi kernel and the whole declarative-infra world demystifies.

bash scripts/verify.sh runs the lifecycle:

plan (empty state):           create vpc, create subnet, create db
apply (dependency order):     create vpc -> create subnet -> create db
re-apply identical config:    create=0 update=0 delete=0   <- idempotent no-op
change db size small->large:  update db [size: "small" -> "large"]
drift (out-of-band resize):   drift on: [db]   <- next apply would revert

That's the engine behind every IaC tool, end to end.

1. Plan = the diff (engine.go)

Plan(desired, state) is the read-only heart: validate deps exist, reject cycles, then per resource emit Create (TestPlanCreate), Update with a diff (TestPlanUpdate), Delete for resources dropped from desired (TestPlanDelete), or NoOp. Plan never touches the world — that preview is the safety feature ("this will destroy the prod DB" before it does). In CI you gate apply on a reviewed plan.

2. Apply = converge, in dependency order (engine.go)

Apply runs the plan, executing creates/updates in topological order (deps first) and deletes in reverse, then writes the new state. TestApplyTopoOrder proves vpc → subnet → db even when declared out of order — because topoSort (Kahn's algorithm, deterministic) orders by DependsOn. TestCycleError and TestMissingDependencyError show the two ways a config is rejected. This ordering is the same correctness concern as ADS in xDS (gw-08) and build order in pa-01.

3. Idempotency = safe to re-run (engine.go)

TestIdempotentReapply: apply a config, then apply it again → zero creates/updates/deletes. Because the second plan diffs desired against a state that already equals it, everything is a NoOp. Idempotency is what lets you run IaC in CI, on a cron, or after a crash without fear — the exact property shared by the reconcile loops in gw-08/gw-10/pa-08. It's also the difference between IaC and an imperative script (which breaks or duplicates on re-run).

4. State and drift (engine.go)

State is the last-applied snapshot the next plan diffs against — the engine's memory of the world, and (in real tools) its most fragile, load-bearing artifact. Drift(state, live) compares state to the actual world: TestDriftDetection shows a hand-edited DB (size: manually-resized) and an unmanaged "rogue" resource both reported. Managed drift is what the next apply reverts — and what GitOps self-heal (pa-08) does continuously and automatically. Lose or corrupt state and the engine plans the wrong delta (recreating live infra), which is why production uses remote state + locking + backups.

5. The architect's view

The code is the kernel; the leverage is what it enables:

Reproducible, reviewable platforms — environments in git, diffed before apply, rebuildable from scratch (vs unauditable click-ops).
The universal reconcile loop — desired vs actual, converge. IaC (here), Kubernetes (gw-10), GitOps (pa-08), and xDS (gw-08) are the same idea; recognizing that is the synthesis an architect brings.
The operational hazards to design around — state locking, blast-radius of a bad apply (plan review + targeted applies), drift, and adopting existing resources (import).

6. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/iacsim

7. Exercises

Targeted destroy ordering: build a chain (vpc → subnet → db), remove all from desired, and verify deletes run in reverse dependency order (db → subnet → vpc).
State locking: add a lock so two concurrent Applys can't corrupt state; prove with -race.
import: add adoption of an existing live resource into state without recreating it (the real-world onboarding problem).
Plan output as a gate: write a CI check that fails if a plan includes a Delete of a resource tagged protected (a fitness function, pa-10).
Make it a reconcile loop: run Apply on a ticker against a mutable "live" world that drifts; you've now built pa-08's GitOps self-heal.

pa-07 — References

IaC engines

Terraform docs — resources, the dependency graph, plan/apply, state + locking, import, modules, providers. https://developer.hashicorp.com/terraform/docs
Pulumi docs — the same model with general-purpose languages instead of HCL; the state/engine concepts are identical.
Crossplane — the reconcile loop inside Kubernetes (operators, gw-10) instead of an external CLI.
Terraform: Up & Running (Brikman) — state, modules, gotchas.

Concepts

Kahn's algorithm / topological sort (apply ordering).
Declarative vs imperative; idempotency; convergence — the reconcile paradigm shared with Kubernetes (gw-10), GitOps (pa-08), xDS (gw-08).
Infrastructure as Code (Kief Morris) — patterns and practices.

Cross-lab links

pa-08 (GitOps runs this loop continuously + self-heal), gw-08/gw-10 (the same reconcile loop in a control plane / k8s), pa-01 (dependency DAG + topo + cycle detection), db-17 (desired-state convergence as the consensus instinct).

pa-07 — Analysis

What the engine must get right

Plan correctness: create/update(+diff)/delete/no-op vs state.
Dependency-ordered apply: topo order for create/update, reverse for delete; reject cycles and missing deps.
Idempotency: re-apply == no-op.
Drift: report where the live world diverges from state (incl. unmanaged resources).

Design decisions

Deterministic topo sort (Kahn). Sorted ready-set so plans/applies are reproducible (testable, reviewable) — the same determinism discipline as the rest of the book.
State as the diff baseline. Plan diffs desired vs last-applied, not desired vs live — matching real tools (and why drift, a separate state-vs-live check, exists).
Deletes in reverse order. Dependencies must be torn down children- first; computing this from the graph avoids orphan/ordering errors.
Attr-map equality for change detection. Minimal but real: a changed attribute yields an Update with a human-readable diff.

Tradeoffs worth flagging

State is the crown jewel and the biggest hazard. Lost/stale/corrupt state makes the engine plan destructive nonsense. Production needs remote state + locking + backups + import; this lab keeps state in-memory to expose the mechanics.
Plan/apply is not transactional. A partial apply (some resources created, then a failure) leaves a mixed world; real tools record partial state and continue/retry. Model this as an exercise.
Drift correction can be destructive. Auto-reverting drift (pa-08 self-heal) overwrites a human's emergency fix; sometimes you want alert-not-revert. An operational policy choice.
Ordering is from declared deps only. Implicit dependencies (a resource referencing another's output) must be made explicit or the graph is wrong — a real Terraform footgun.

What production adds beyond this lab

Providers (real cloud APIs), modules, variables/outputs, remote state + locking, partial-apply recovery, import, and plan as a reviewed CI artifact gating apply (GitOps, pa-08).

pa-07 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-07-infrastructure-as-code && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-07-infrastructure-as-code/src/go
go test -race -count=1 ./...      # plan, topo apply, idempotency, drift, cycles
go run ./cmd/iacsim

Package map

File	What
`iac/engine.go`	resources + DAG; Plan (diff), Apply (topo converge), State, Drift, cycle detection
`cmd/iacsim`	plan → apply → idempotent re-apply → update diff → drift demo

See GUIDE.md for the deep dive and the universal reconcile- loop connection.

pa-07 — Verification

One command

cd pa-07-infrastructure-as-code && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestPlanCreate`	empty state + N desired → N creates
`TestApplyTopoOrder`	apply respects dependencies (vpc → subnet → db) regardless of declaration order
`TestIdempotentReapply`	re-applying identical config is all no-op
`TestPlanUpdate`	a changed attribute plans an update (with a diff)
`TestPlanDelete`	a resource removed from desired plans a delete
`TestDriftDetection`	out-of-band edits + unmanaged resources are reported
`TestCycleError`	a dependency cycle is rejected
`TestMissingDependencyError`	a dependency on an undefined resource is rejected

All under -race.

What "green" does NOT guarantee

In-memory state. Production needs remote state + locking + backups
- import; state fragility is the real operational hazard.
No partial-apply recovery / providers. Real apply is non- transactional against real cloud APIs (exercise).
Drift is detected, not auto-corrected. Continuous self-heal is pa-08 (and is itself a policy choice).

pa-08 — GitOps & Progressive Delivery

The Apple JD lists "DevOps and CI/CD methodologies… such as ArgoCD, Flux, or Jenkins" and "familiar with GitOps workflows and progressive delivery practices." GitOps is a specific, powerful idea: git is the single source of truth for the desired state, and a reconciler continuously makes the running system match it — pulling, not pushing. It is pa-07's plan/apply turned into a continuous control loop with self-heal and prune.

You build the reconciler: sync (create/update), prune (delete what git removed), self-heal (revert manual drift), sync-wave ordering, drift detection, and an SLO-gated promotion.

1. What is it?

GitOps = declarative desired state in git + an agent that continuously reconciles the live system to it:

   git repo (desired state) ──pull──▶  [Reconciler]  ──converge──▶  cluster (live)
        ▲ commits = the only way to change prod        │
        │                                              ├─ SYNC: apply created/changed
   PR review = change control                          ├─ PRUNE: delete what git dropped
                                                       └─ SELF-HEAL: revert manual drift

Vs imperative push CD (Jenkins runs kubectl apply from a pipeline), GitOps pulls: the agent in the cluster watches git and converges, so the cluster can't drift from git for long (it self-heals), and the audit trail is the git history.

Progressive delivery layers safe rollout on top: canary / blue-green / SLO-gated promotion with automatic rollback (the full ladder is gw-12). GitOps makes the rollout itself declarative ("90% v1, 10% v2 in git").

2. Why does it matter?

It closes the drift loop pa-07 left open. IaC detects drift; GitOps continuously corrects it. The cluster converges to git within seconds of any change — a hand-edit at 3am is reverted automatically (or surfaced). That's a fundamentally more reliable operational model.
Git becomes change control + audit + rollback. Every change is a reviewed PR (a testing strategy, pa-10); every state is a commit; rollback is git revert. No "what's actually deployed?" mystery, no unaudited kubectl edits.
Pull beats push for security and scale. The cluster pulls from git; you don't hand CI cluster-admin credentials or open the cluster to the pipeline. One reconciler per cluster scales to thousands of apps.
It's the same reconcile loop, again. GitOps (here), IaC (pa-07), Kubernetes operators (gw-10), and xDS (gw-08) are all desired-vs-actual convergence. An architect who names that pattern designs control planes the same way every time — and it's db-17's "drive replicas to a desired state" instinct.

3. How does it work?

The reconcile loop

Reconcile(desired, live, prune) is the loop body, run continuously:

Sync: a desired resource missing from live → create; present but different → update.
Self-heal: that "present but different" case also catches manual drift — if someone hand-edited the cluster, live differs from git, so reconcile reverts it. Sync and self-heal are the same mechanism; the power is running it continuously.
Prune: a live resource git no longer declares → delete (if pruning is enabled — a safety toggle, since prune is destructive).

Sync waves (ordering)

Resources carry a wave; the reconciler applies low waves first (CRDs before the workloads that use them; namespaces before resources in them). This is pa-07's dependency ordering expressed as explicit phases — the same "declaration before use" concern as ADS in xDS (gw-08).

Drift detection and idempotency

Diff(desired, live, prune) reports out-of-sync resources (missing, drifted, or extra). A converged reconcile is a no-op (idempotent) — the property shared by every reconcile loop in the book, and what makes running it every few seconds safe.

Progressive delivery + rollback

PromoteOrRollback(current, candidate, healthy) is the decision in miniature: promote the candidate only if healthy (SLO-gated), else keep the current version live (instant rollback — the old version never left). The full shadow → canary → ramp ladder with automated analysis is gw-12; GitOps makes the rollout state itself declarative and revertable.

4. Core terminology

Term	Definition
GitOps	Git as the source of truth + a reconciler that converges live state to it.
Pull vs push CD	Cluster pulls from git (GitOps) vs a pipeline pushes to the cluster (Jenkins).
Sync	Apply created/changed resources from git to the cluster.
Prune	Delete cluster resources git no longer declares.
Self-heal	Continuously revert manual drift back to git's desired state.
Sync wave	An ordering phase for applying resources (low waves first).
Drift	Live state differing from git (what self-heal corrects).
Progressive delivery	Canary/blue-green/SLO-gated rollout with auto-rollback (gw-12).
Reconcile loop	Level-triggered convergence: observe desired+actual, act, repeat.

5. Mental models

GitOps is a thermostat wired to git. You set the target in git (the dial); the agent continuously drives the room (cluster) to it. Open a window (manual drift) and the thermostat works to close the gap. Push CD is "manually adjust the heater once and hope nobody touches it."
Git is the system's single source of truth, including 'undo.' What should be running is whatever's committed; deploying is git push, rolling back is git revert, and the audit log is git log. The cluster is a derived, disposable projection of git.
Prune is a chainsaw — useful, dangerous, toggle-guarded. "Delete whatever git doesn't mention" cleans up beautifully and can wipe a resource you forgot to commit. That's why prune is opt-in and sync-waves/owner-refs scope it.
Self-heal vs break-glass. Continuous self-heal is great until an on-call engineer makes an emergency manual fix — which self-heal then reverts. Mature GitOps has a break-glass (pause reconcile) for exactly that. Automation must have an off switch.

6. Common misconceptions

"GitOps is just CI/CD with git." The distinction is pull + continuous reconcile + self-heal: the cluster actively converges to git and corrects drift, rather than a pipeline pushing once and walking away. That continuous-convergence property is the whole point.
"Prune is safe to leave on everywhere." Prune deletes anything not in git; an un-committed resource or a mis-scoped app can cause data loss. Enable deliberately, scope with owner references, and review prune diffs.
"Self-heal means we never have incidents." It reverts config drift; it can also revert a human's emergency fix. You need a break-glass and to treat persistent drift as a signal (someone keeps fixing something git gets wrong).
"GitOps replaces progressive delivery." They compose: GitOps is how the desired state is delivered; progressive delivery (gw-12) is how cautiously you shift traffic to a new version with rollback.
"One giant git repo / app for everything." Blast radius: a bad commit syncs everywhere at once. Structure by app/environment, use sync waves and progressive delivery, and stage changes — the same migration discipline as gw-12.

7. Interview talking points

"What is GitOps and how is it different from a Jenkins pipeline?" Git as the single source of truth + an in-cluster reconciler that pulls and continuously converges (sync/prune/self-heal). Vs a pipeline that pushes once: GitOps self-corrects drift, doesn't need cluster creds in CI, and makes git the audit log + rollback (git revert).
"How does self-heal work and when is it dangerous?" The reconcile loop detects live ≠ git (drift) and reverts to git — same mechanism as sync, run continuously. Dangerous when it reverts an emergency manual fix; mitigate with a break-glass (pause) and by treating recurring drift as a bug in git.
"How do you order dependent resources?" Sync waves (CRDs/namespaces before workloads) — pa-07's dependency ordering as explicit phases. Same declaration-before-use concern as xDS ADS (gw-08).
"GitOps + progressive delivery together?" GitOps delivers the declared state; progressive delivery (gw-12) gates the version cutover with canary/SLO analysis and auto-rollback. The rollout itself is declared in git and revertable.
"It's the same loop as Kubernetes/Terraform/xDS — why does that matter?" Desired-vs-actual convergence is the universal control paradigm. Recognizing it means you design control planes consistently (idempotent, level-triggered, drift-correcting) and reuse the same operational playbook (db-17's instinct at the platform layer).

8. Connections to other labs

pa-07 (IaC) — GitOps is plan/apply as a continuous loop with self-heal + prune; both are the reconcile paradigm.
gw-10 / gw-08 (operators / xDS) — the same loop hosted in Kubernetes / a control plane; ArgoCD/Flux are themselves operators.
gw-12 (progressive delivery) — the full shadow→canary→ramp ladder this lab's promotion gate references.
pa-09 (SLOs) — promotion gates and rollback triggers are SLO decisions; pa-10 — PR review of git changes is a testing strategy / consensus mechanism.
db-17 (Raft) — converge to a replicated desired state is the consensus instinct underlying GitOps.

pa-08 — The Hitchhiker's Guide to GitOps

Companion to CONCEPTS.md, with the runnable reconciler in src/go/gitops/. This is pa-07's plan/apply turned into a continuous, self-healing control loop — the ArgoCD/Flux model.

bash scripts/verify.sh runs the loop end to end:

initial sync (sync-wave order):  apply crd -> apply config -> apply workload
manual drift on 'config':        healed=[config] (reverted to "v1")
git removes 'config':            pruned=[config]; workload updated to "v2"
progressive delivery:            healthy -> v2;  unhealthy -> v1 rolledBack=true

1. The reconcile loop (reconcile.go)

Reconcile(desired, live, prune) is the loop body, meant to run continuously. It does three things, and the second is the GitOps superpower:

Sync — create missing resources, update changed ones (TestSyncCreates, TestSyncUpdatesOnGitChange).
Self-heal — the "update changed" branch also catches manual drift: TestSelfHealRevertsManualDrift hand-edits svc to HACKED and the next reconcile reverts it to git's v1. Sync and self-heal are the same mechanism; running it continuously is what makes the cluster unable to drift from git for long.
Prune — delete what git dropped (TestPruneDeletesRemoved), but only when enabled (TestNoPrunePreservesExtra) because prune is destructive.

TestReconcileIsIdempotent confirms a converged pass is a no-op — the property that makes running this every few seconds safe, shared with pa-07, gw-08, and gw-10.

2. Sync waves (reconcile.go)

TestSyncWaveOrdering applies crd (wave 0) → config (wave 1) → deploy (wave 2) regardless of declaration order. This is pa-07's dependency ordering as explicit phases — CRDs before the workloads that use them, namespaces before their contents — the same declaration-before-use concern as ADS in xDS (gw-08).

3. Pull, not push (the architecture decision)

The deep point isn't in the code, it's in the direction. A Jenkins pipeline pushes (kubectl apply from CI, with cluster credentials, once). GitOps pulls: an agent in the cluster watches git and converges. Consequences an architect cares about:

No drift survives — self-heal continuously corrects (vs push, where the cluster drifts freely between deploys).
No cluster creds in CI — the cluster reaches out; you don't expose it to the pipeline.
Git is change control + audit + rollback — every change is a reviewed PR (pa-10), every state a commit, rollback is git revert.
It scales — one reconciler per cluster handles thousands of apps.

4. Progressive delivery + the off switch

PromoteOrRollback(current, candidate, healthy) is the SLO-gated cutover in miniature (TestProgressiveDeliveryGate): healthy → promote; unhealthy → keep current (instant rollback, the old version never left). The full shadow → canary → ramp ladder with automated analysis is gw-12; GitOps makes the rollout state itself declarative and revertable.

The maturity detail the GUIDE insists on: automation needs a break-glass. Continuous self-heal will revert an on-call engineer's emergency manual fix. Real GitOps lets you pause reconcile for that window — and treats recurring drift as a signal that git is wrong.

5. The pattern, one more time

IaC (pa-07), GitOps (here), Kubernetes operators (gw-10), and xDS (gw-08) are the same level-triggered, idempotent, converge-to-desired-state loop. ArgoCD/Flux are literally Kubernetes operators for "the app in git." An architect who internalizes this designs every control plane the same way and reuses one operational playbook — the db-17 instinct ("drive the system to a replicated desired state") at the platform layer.

6. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/gitopssim

7. Exercises

Continuous loop: run Reconcile on a ticker against a "live" map that a goroutine randomly drifts; show convergence within one tick.
Break-glass: add a paused flag that suspends self-heal so an emergency manual change survives; log that drift exists while paused.
Owner-ref-scoped prune: only prune resources this app owns, so a mis-scoped app can't delete another's resources.
Wire SLO gating: replace the boolean healthy with a real burn-rate check (pa-09) and the canary ladder (gw-12).
Multi-env: model dev/prod as separate desired sets and a promotion that's a git merge from dev→prod (the GitOps promotion pattern).

pa-08 — References

GitOps

ArgoCD docs — application reconcile, sync waves, self-heal, prune, app-of-apps, progressive sync. https://argo-cd.readthedocs.io/
Flux docs — the GitOps toolkit (source/kustomize/helm controllers). https://fluxcd.io/
OpenGitOps principles — declarative, versioned, pulled, continuously reconciled. https://opengitops.dev/
Weaveworks — the original "GitOps" essays.

Progressive delivery

Argo Rollouts / Flagger — canary, blue-green, analysis, auto-rollback (the gw-12 ladder, Kubernetes-native).
Google SRE Workbook — canarying releases (pairs with pa-09 SLOs).

CI/CD context

Jenkins / GitHub Actions / GitLab CI — push-based CD (the contrast).
Accelerate (Forsgren et al.) — why small, frequent, reversible changes win.

Cross-lab links

pa-07 (IaC = plan/apply; GitOps = the continuous loop), gw-10/gw-08 (the same reconcile loop in k8s / a control plane), gw-12 (full progressive-delivery ladder), pa-09 (SLO-gated promotion), pa-10 (PR review as change control), db-17.

pa-08 — Analysis

What the reconciler must get right

Sync create/update from git; self-heal revert manual drift (same mechanism, run continuously).
Prune only when enabled (destructive, opt-in).
Sync-wave ordering (low waves first).
Idempotent converged passes; accurate Diff.
SLO-gated promotion with instant rollback to the live version.

Design decisions

Pull/converge model. Reconcile takes desired (git) + live (cluster) and mutates live toward desired — the in-cluster agent model, not a push pipeline. This is what gives continuous self-heal.
Self-heal == sync. A drifted resource is just "live differs from desired," handled by the same apply path. The power is in running it on a loop, not in special-casing drift.
Prune is opt-in. "Delete what git doesn't declare" is powerful and dangerous; a flag (and, in production, owner-reference scoping) guards it.
Deterministic wave+name ordering. Reproducible apply order for tests and for declaration-before-use (CRDs first).

Tradeoffs worth flagging

Self-heal vs break-glass. Continuous reversion fights emergency manual fixes; production needs a pause switch and should treat recurring drift as a git bug.
Prune blast radius. A mis-scoped app or uncommitted resource can be deleted; scope with owner refs, review prune diffs.
One-repo blast radius. A bad commit can sync everywhere; structure by app/env + progressive delivery + staged rollout (gw-12).
Pull adds reconcile latency. Convergence is eventual (poll/notify interval); fine for config, but not a substitute for a real rollout gate on risky changes.

What production adds beyond this lab

Real source/kustomize/helm controllers, health assessment, app-of-apps, RBAC, drift notifications, and a pause/break-glass.
Progressive delivery (Argo Rollouts/Flagger) with SLO analysis (pa-09) and the full ladder (gw-12).
Secrets management and multi-cluster/multi-env promotion via git merges.

pa-08 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-08-gitops-progressive-delivery && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-08-gitops-progressive-delivery/src/go
go test -race -count=1 ./...      # sync, self-heal, prune, sync waves, promotion
go run ./cmd/gitopssim

Package map

File	What
`gitops/reconcile.go`	continuous reconcile (sync/prune/self-heal), sync waves, Diff, SLO-gated promotion
`cmd/gitopssim`	initial sync, self-heal of drift, prune, git update, promotion demo

See GUIDE.md for the deep dive and the pull-vs-push architecture decision.

pa-08 — Verification

One command

cd pa-08-gitops-progressive-delivery && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestSyncCreates`	git resources are created in the cluster
`TestSyncUpdatesOnGitChange`	a git spec change updates live
`TestSelfHealRevertsManualDrift`	a manual cluster edit is reverted to git
`TestPruneDeletesRemoved`	a resource removed from git is pruned (when enabled)
`TestNoPrunePreservesExtra`	prune disabled keeps extra live resources
`TestSyncWaveOrdering`	resources apply in ascending sync-wave order
`TestReconcileIsIdempotent`	a converged reconcile is a no-op; Diff is empty
`TestProgressiveDeliveryGate`	healthy → promote; unhealthy → rollback to current

All under -race.

What "green" does NOT guarantee

No real git/cluster controllers. Production = ArgoCD/Flux with health assessment, RBAC, app-of-apps, break-glass.
Promotion is a boolean gate. The full shadow→canary→ramp ladder with SLO analysis is gw-12 + pa-09.
Self-heal has no off switch here. Production needs break-glass; auto- reverting an emergency fix is a real hazard (GUIDE §4).

pa-09 — Reliability Engineering: SLOs, Error Budgets & Bulkheads

The Apple JD asks for knowledge of "observability and reliability engineering, including SLOs, distributed tracing, and circuit breakers." Tracing internals you built in gw-11; circuit breakers and adaptive concurrency in gw-06. This lab adds the reliability-engineering discipline an architect uses to govern the platform: SLOs and error budgets, multi-window burn-rate alerting (page on a real, sustained problem — never on a blip), and bulkheads (concurrency isolation).

You build the SLO + burn-rate engine and a bulkhead, and prove the alerting fires correctly.

1. What is it?

SLI (indicator) — a measured quality signal (success ratio, p99 latency).
SLO (objective) — a target for an SLI over a window (99.9% of requests succeed over 28 days).
Error budget — 1 − SLO: the allowed unreliability. It's a budget you spend on risk (shipping, migrations) and that pauses feature work when exhausted. Reliability becomes arithmetic, not argument.
Burn rate — how fast you're consuming the budget: observedErrorRate / errorBudget. Burn rate 1 = on track to exactly exhaust the budget over the window; >1 = too fast.
Multi-window alerting — page only when a long window (it's sustained) and a short window (it's still happening) both show a high burn rate. This is the Google-SRE trick that stops a transient spike from paging you and stops a stale alert from firing after recovery.
Bulkhead — a fixed concurrency budget per dependency, so a saturated/slow dependency can only exhaust its own slots, not the whole service's.

2. Why does it matter?

SLOs turn "how reliable?" into a number and a budget. Without them, reliability is a vibe and an argument between dev (ship!) and ops (stop!). With them, a healthy budget licenses risky changes (migrations, gw-12) and a burning budget mandates stabilization — objectively. An architect sets the SLO framework that governs release velocity across teams.
Alerting quality is a system property you design. Page on every error and you train people to ignore pages (alert fatigue → missed real incidents). Page only on user-visible SLO burn, and rarely. The multi-window/multi-burn-rate scheme is the state of the art, and it's a design decision, not a dashboard afterthought.
Bulkheads (+ circuit breakers, gw-06) bound blast radius. The pa-01 blast-radius analysis tells you what a failure can reach; bulkheads, breakers, and timeouts are how you contain it so one sick dependency doesn't take the whole service down (cascading failure).
Reliability is a first-class architecture concern, not an afterthought. The "-ilities" (availability, latency) trade off against each other and against velocity (CAP/PACELC, pa-06). An architect makes those trade-offs explicit, per service, via SLOs.

3. How does it work?

Error budget and burn rate

SLO{Target: 0.999} → ErrorBudget() = 0.001. BurnRate(total, bad) = errorRate / errorBudget. A 1% error rate against a 1% budget is burn rate 1 (sustainable for exactly the window); a 10% error rate is burn rate 10 (exhausts in 1/10 the window). The famous thresholds: ~14.4× burns 2% of a 30-day budget in 1 hour → page; ~6× over 6 hours → page or ticket.

Multi-window, multi-burn-rate alerting

PAGE   if long-window burn >= pageThreshold  AND short-window burn >= pageThreshold
TICKET if long-window burn >= ticketThreshold AND short-window burn >= ticketThreshold
else OK

SLO.Alert(long, short, pageThreshold, ticketThreshold) implements it. The long window proves the burn is sustained (not a blip); the short window proves it's still happening (so the alert clears quickly after recovery). Both must be hot to page — that conjunction is the whole trick. (Production uses several window pairs at different thresholds; the principle is identical.)

Bulkheads

Bulkhead{max} is a counting semaphore: TryAcquire takes a slot or rejects (fail fast, no queueing); Release returns it. Give each dependency its own bulkhead and a saturated dependency exhausts only its slots — the rest of the service keeps serving. Combine with circuit breakers (open when a dependency is failing) and timeouts (bound each call) and adaptive concurrency (gw-06) for the full stability toolkit (Nygard's Release It!).

Graceful degradation

When budget burns or a dependency is bulkheaded/broken, degrade, don't hard-fail: serve stale/cached data, drop optional features, shed low-priority traffic (gw-06). The architect decides, per feature, what "degraded but up" looks like — the Netflix "the show must go on" ethos.

4. Core terminology

Term	Definition
SLI / SLO / SLA	Indicator / internal objective / external contract (with penalties).
Error budget	`1 − SLO`; allowed unreliability; the risk currency.
Burn rate	`errorRate / errorBudget`; >1 = consuming budget too fast.
Multi-window alerting	Page only when long AND short windows both burn fast.
Alert fatigue	Too many/noisy alerts → real ones ignored.
Bulkhead	Per-dependency concurrency cap; isolates saturation.
Cascading failure	One failure exhausting shared resources, toppling others.
Graceful degradation	Reduced-but-available service under stress.
Toil / golden signals	Manual repetitive ops / latency-traffic-errors-saturation (gw-11).

5. Mental models

The error budget is a spending account for risk. Reliability you don't spend is velocity left on the table; overspend (budget burned) and you must stop shipping and stabilize. It turns the dev-vs-ops tug-of-war into arithmetic both sides accept.
Multi-window alerting is "is it raining AND still raining?" The long window asks "has it been raining a while?" (not a single drop); the short window asks "is it still raining right now?" (so you stop the alarm once it clears). You only sound the flood siren when both are true.
Bulkheads are a ship's watertight compartments. A hull breach (saturated dependency) floods one compartment, not the whole ship. Without them, water (load) from one leak spreads everywhere and you sink (cascading failure).
Page on symptoms, alert-elsewhere on causes. The pager is for "users are hurting" (SLO burn). Causes (pool exhausted, breaker open, high CPU) go to dashboards and tickets for diagnosis. Confusing the two is how you get alert fatigue.

6. Common misconceptions

"Aim for 100% / five nines everywhere." 100% is the wrong target (impossible, and it forbids any risk/velocity). Pick the SLO users actually need; the error budget is the point, not a failure. Five nines is extremely expensive — justify it per service.
"Alert on every error." That's alert fatigue: people mute the pager and miss the real incident. Alert on SLO burn rate (user impact), multi-window, rarely. Cause-metrics are for dashboards.
"More retries/timeouts = more reliable." Past a point they amplify load (retry storms, gw-06) and turn a blip into an outage. Reliability comes from budgets, breakers, bulkheads, and shedding — bounding work, not adding it.
"Reliability is an ops problem." It's an architecture decision: the -ilities trade off and are designed in (boundaries, async, bulkheads, consistency choices). SLOs make those trade-offs explicit and owned.
"A bulkhead is the same as a circuit breaker." Bulkhead = bound concurrency (isolate saturation); circuit breaker = stop calling a failing dependency (fail fast on errors). They compose; you want both (plus timeouts + adaptive concurrency, gw-06).

7. Interview talking points

"How do you set and use SLOs?" Define SLIs users feel (success ratio, p99 latency), set an SLO target (the reliability users need, not 100%), derive the error budget (1−SLO), and use the budget to govern velocity: healthy → ship/migrate; burning → freeze and stabilize. The architect provides this framework org-wide.
"How do you alert on SLOs without alert fatigue?" Multi-window, multi-burn-rate: page only when a long window (sustained) AND a short window (ongoing) both exceed a high burn rate; ticket on slower burns; page on symptoms not causes. This avoids blip-paging and stale alerts.
"How do you prevent cascading failure?" Bound the blast radius: timeouts on every call, circuit breakers (fail fast on a sick dependency, gw-06), bulkheads (per-dependency concurrency isolation), load shedding, and graceful degradation. Tie to pa-01's blast-radius analysis: contain what a failure can reach.
"Bulkhead vs circuit breaker vs adaptive concurrency?" Bulkhead caps concurrency per dependency (isolation); breaker stops calls to a failing dependency (fail fast); adaptive concurrency (gw-06) infers the right in-flight limit from latency. Layer all three.
"100% uptime — good goal?" No: impossible, and it eliminates the error budget that lets you ship and take risks. Pick the SLO users need; treat the budget as something to spend. Over-reliability is as much a failure as under-reliability (wasted velocity/cost).

8. Connections to other labs

gw-06 — circuit breakers and adaptive concurrency; bulkheads here complete the stability toolkit.
gw-11 — the SLIs (RED metrics, histograms, traces) these SLOs are computed from; burn-rate is the alerting layer over those signals.
gw-12 / pa-08 — error budgets gate progressive delivery and promotion; a burning budget freezes risky rollouts.
pa-01 — blast-radius analysis says what a failure reaches; bulkheads/breakers say how you contain it.
pa-06 — consistency/availability choices (CAP/PACELC) are SLO decisions.

pa-09 — The Hitchhiker's Guide to SLOs & Reliability

Companion to CONCEPTS.md, with the runnable engine in src/go/reliability/. The reliability- engineering discipline that governs how an architect's platform is operated — and how fast it's allowed to change.

bash scripts/verify.sh runs the demo, and the alerting table is the whole lesson:

multi-window burn-rate alerting (page=14x, ticket=1x):
  sustained outage   longBurn=15.0x shortBurn=15.0x -> PAGE
  transient blip     longBurn= 1.0x shortBurn=20.0x -> OK      <- doesn't page!
  slow burn          longBurn= 2.0x shortBurn= 2.0x -> TICKET
  healthy            longBurn= 0.5x shortBurn= 0.5x -> OK
bulkhead isolation: depA saturated; depB unaffected

1. Error budgets and burn rate (slo.go)

SLO{Target} → ErrorBudget = 1 − Target. BurnRate(total, bad) = errorRate / errorBudget: TestBurnRate shows a 1% error rate against a 1% budget is burn rate 1 (sustainable — exactly exhausts the window), and 10% is 10× (exhausts in 1/10 the time). That single number converts "are we okay?" into arithmetic. The architect-level use: a healthy budget licenses risk (ship features, run migrations — gw-12); a burning budget mandates stabilization. SLOs end the dev-vs-ops argument by making it math.

Float note: the tests compare burn rates with a tolerance because 1 − 0.99 isn't exact in floating point — a small but real reminder that you never == floats.

2. Multi-window alerting — the anti-fatigue trick (slo.go)

Alert(long, short, pageThreshold, ticketThreshold) pages only when both a long window (sustained, not a blip) and a short window (still happening, not already resolved) exceed the page threshold. TestMultiWindowDoesNotPageOnBlip is the key test: a short window at 20× burn but a long window at 1× → no page. That conjunction is the Google-SRE state of the art, and it's the difference between a pager people trust and one they mute. TestMultiWindowTicketsOnSlowBurn shows a slow sustained burn opening a ticket instead.

The design rule that falls out: page on symptoms (SLO burn = users hurting), not causes (CPU, pool exhaustion). Causes go to dashboards for diagnosis; only user impact wakes someone up.

3. Bulkheads (bulkhead.go)

Bulkhead{max} is a counting semaphore: TryAcquire takes a slot or rejects (fail fast, no queue). TestBulkheadsAreIndependent is the point: dependency A is saturated, yet B acquires fine — A's failure can't starve the whole service. This is how you contain the blast radius pa-01 told you to measure. Layer it with circuit breakers (stop calling a failing dependency) and adaptive concurrency (infer the right limit) from gw-06, plus timeouts, for Nygard's full stability toolkit.

4. Where this sits in the architect's toolkit

The SLIs come from gw-11 (RED metrics, histograms, traces); this is the alerting + governance layer on top.
The containment primitives (breakers, adaptive concurrency) are gw-06; bulkheads here complete them.
The error budget gates progressive delivery (gw-12) and GitOps promotion (pa-08) — a burning budget freezes risky rollouts.
Reliability is an architecture decision: the -ilities trade off (CAP/PACELC, pa-06) and are designed in, per service, with SLOs making the choice explicit and owned.

5. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/relsim

6. Exercises

Real burn-rate windows: feed a time series of (total, bad) per minute and compute rolling 5m/1h/6h burn rates; reproduce the SRE multi-burn-rate alert matrix.
Budget-gated deploys: block a (pa-08/gw-12) promotion when the remaining error budget is below a threshold.
Bulkhead + breaker + timeout: wrap a flaky dependency call in all three (reuse gw-06's breaker/limiter) and show graceful degradation under load.
Priority shedding: when a bulkhead is full, reject low-priority work first (criticality tiers) so the core path survives.
SLO dashboard: compute remaining budget over a 28-day window and project the exhaustion date from the current burn rate.

pa-09 — References

SLOs & alerting

Google SRE Book / Workbook — SLIs/SLOs/error budgets; Alerting on SLOs (multi-window, multi-burn-rate). The canon. https://sre.google/workbook/alerting-on-slos/
Implementing Service Level Objectives (Alex Hidalgo) — the practical SLO playbook.
Error-budget policy examples (how budget exhaustion changes behavior).

Stability patterns

Michael Nygard, Release It! — circuit breakers, bulkheads, timeouts, steady state, fail fast (pairs with gw-06).
Marc Brooker — retries, jitter, metastable failures.
Netflix Hystrix (archived) / resilience4j — bulkhead + breaker impls.

Observability (the SLIs)

gw-11 (RED/USE, histograms, tracing) — what SLOs are computed from.
Brendan Gregg — USE method; the four golden signals.

Cross-lab links

gw-06 (breakers/adaptive concurrency), gw-11 (the SLIs), gw-12 / pa-08 (error-budget-gated rollout), pa-01 (blast radius to contain), pa-06 (CAP/PACELC as SLO choices).

pa-09 — Analysis

What the engine must get right

Error budget = 1 − target; burn rate = errorRate / budget.
Multi-window alert: PAGE only when long AND short windows exceed the page threshold; TICKET on slower sustained burn; never page a blip.
Bulkhead isolation: a full bulkhead rejects; independent dependencies don't interfere.

Design decisions

Burn rate, not raw error rate. Normalizing by the budget makes the number comparable across services and ties directly to "time to exhaustion," which is what you alert on.
Conjunction (long AND short) to page. The long window kills blip-paging; the short window auto-resolves the alert after recovery. This is the single most important alerting design choice.
Float tolerance in tests. 1 − target isn't exact; comparisons use an epsilon — the correct way to test floating-point math.
Bulkhead = non-blocking semaphore. Reject-not-queue (fail fast) so a saturated dependency doesn't build an unbounded queue (the gw-06 lesson).

Tradeoffs worth flagging

SLO target vs cost/velocity. Higher targets cost exponentially and shrink the error budget (less room to ship). Pick what users need, not the max.
Window sizes vs detection speed. Short long-windows page faster but risk blips; long ones are robust but slow. Production uses several pairs at different thresholds to balance.
Bulkhead sizing. Too small starves a healthy dependency; too large fails to isolate. Size to the dependency's healthy concurrency, like a pool (gw-04).
Symptom vs cause alerting. Page on SLO burn (user impact); route cause metrics to dashboards. Mixing them causes fatigue.

What production adds beyond this lab

Rolling time-window SLI computation from real metrics (gw-11), multiple burn-rate/window pairs, and an error-budget policy (what freezes when the budget is gone).
Bulkheads + breakers + adaptive concurrency + timeouts composed per dependency (gw-06), with criticality-aware shedding and degradation.

pa-09 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-09-reliability-slo && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-09-reliability-slo/src/go
go test -race -count=1 ./...      # error budget, burn rate, multi-window alert, bulkheads
go run ./cmd/relsim

Package map

File	What
`reliability/slo.go`	SLO, error budget, burn rate, multi-window multi-burn-rate alerting
`reliability/bulkhead.go`	per-dependency concurrency isolation (fail-fast semaphore)
`cmd/relsim`	burn-rate alert table (blip vs sustained vs slow) + bulkhead isolation

See GUIDE.md for the deep dive and the alert-fatigue design rule.

pa-09 — Verification

One command

cd pa-09-reliability-slo && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestErrorBudget`	error budget = 1 − target (within float tolerance)
`TestBurnRate`	burn rate = errorRate / budget (1× sustainable, 10× fast, 0 with no traffic)
`TestMultiWindowPagesOnSustainedBurn`	both windows hot → PAGE
`TestMultiWindowDoesNotPageOnBlip`	short hot but long cool → NOT a page
`TestMultiWindowTicketsOnSlowBurn`	slow sustained burn → TICKET
`TestBulkheadIsolation`	a full bulkhead rejects; release frees a slot
`TestBulkheadsAreIndependent`	a saturated dependency doesn't starve another

All under -race.

What "green" does NOT guarantee

No real metrics pipeline. SLIs come from gw-11; this is the alerting
- governance layer.
Single window pair. Production uses several burn-rate/window pairs + an error-budget policy.
Bulkhead only. Breakers/adaptive concurrency/timeouts (gw-06) compose with it; not all wired here.

pa-10 — Architecture in Practice: ADRs, Design Reviews & Fitness Functions

The Apple JD's through-line is the architect's job: "software architecture and systems design," "software quality methodologies, including design review, code review, and testing strategies," and "mentor engineers and build consensus across teams on cross-cutting technical decisions." This capstone is about doing architecture — the practices that turn good designs into durable, evolvable systems other engineers build on.

The runnable artifact is fitness functions: automated architecture tests (no dependency cycles, layering rules, coupling budgets) that run in CI so the design's key properties are enforced mechanically, not policed by hand. The written artifacts are the ADR, RFC, and design-review templates in steps/ — the architect's tools for capturing decisions and building consensus.

1. What is it?

Being an architect is three loops:

Decide and record. Make cross-cutting decisions and capture them as ADRs (Architecture Decision Records): context, options considered, the decision, and consequences — so the why outlives you. Bigger or contested decisions get an RFC circulated for input.
Review and align. Run design reviews that pressure-test a proposal (surfacing the failure mode the author missed), and high-quality code reviews that teach and raise the bar. The goal is consensus, reached by argument + data + prototypes, not authority.
Enforce evolution. Encode the design's invariants as fitness functions — automated tests of architectural properties — so the system stays coherent as it grows ("evolutionary architecture"). A paved road + a fitness function beats a review bottleneck.

The leverage of an architect is org-shaped: you make the right thing the easy thing for many teams (Conway's Law, Team Topologies).

2. Why does it matter?

Decisions decay without records. Six months later, nobody remembers why you chose at-least-once + idempotency over a queue, so someone "fixes" it and reintroduces the bug. An ADR is cheap insurance against re-litigating settled decisions and against losing the reasoning when people leave.
Architecture rots without enforcement. Every well-layered system drifts: a deadline-pressured PR adds a domain → infra dependency or a cycle, and reviews miss it. A fitness function in CI makes that regression fail the build — the same idea as pa-02's contract gate and pa-01's cycle check, generalized. This is how an architecture survives contact with 50 engineers.
Consensus is the real bottleneck, and it's a skill. The hard part of a cross-cutting decision (the eventing standard, the SLO framework) is aligning teams with different incentives. Architects who mandate get ignored or routed around; architects who build consensus (write the RFC, prototype, let data win, run the review well) get adoption. The JD names this explicitly.
Mentorship and reviews are leverage, not overhead. An architect who only ships code scales to one person's output; one who raises the bar via reviews and paved roads multiplies the whole org. "Enable other engineers to build better products, faster" is the role.

3. How does it work?

Fitness functions (architecture as tests)

A fitness function is an executable check of an architectural property. The engine here defines Rules — NoCycles, Layering, MaxFanOut — and Evaluate(graph, rules) runs them, returning a Report whose Passed is false if any rule is violated. Wire it as a TestArchitecture in CI and the build fails when:

a dependency cycle appears (the distributed-monolith smell, pa-01),
a layering rule is broken (domain → infra),
a coupling budget (max fan-out) is exceeded.

These are objective, automated versions of things reviews try (and fail) to catch by eye. The same pattern enforces contract compatibility (pa-02), test coverage, performance budgets, and naming conventions.

ADRs and RFCs (capturing the why)

An ADR is a short, immutable record per decision (see steps/01-adr-template.md):

# ADR-NNN: <title>
Status: proposed | accepted | superseded by ADR-MMM
Context:   the forces and constraints (the -ilities in tension)
Decision:  what we chose
Alternatives considered:  with why we rejected them
Consequences:  what becomes easier/harder; what we'll revisit

ADRs live in the repo, are versioned, and form a decision log. An RFC is the heavier, pre-decision document circulated to build consensus on a big/contested choice.

Design & code reviews (raising the bar)

A good design review uses a checklist (see steps/02-design-review-checklist.md) to pressure-test boundaries, contracts, data/consistency, failure modes, rollout, and observability before code is written. Code review is where standards propagate and engineers grow — review to teach, not just to gate. Both are testing strategies in the JD's sense.

Building consensus (the meta-skill)

Authority doesn't scale across teams. The architect's toolkit: write it down (RFC/ADR), prototype to replace opinion with evidence, run the review so the room reaches the decision, give first-mover teams a paved road, and let the fitness function (not your nagging) enforce it afterward.

4. Core terminology

Term	Definition
ADR	Architecture Decision Record: context, decision, alternatives, consequences.
RFC	A pre-decision proposal circulated to build consensus.
Fitness function	An automated test of an architectural property (cycles, layering, coupling).
Evolutionary architecture	Architecture as a continuously-tested, changeable property (Ford et al.).
Design review	Structured pressure-test of a proposal before build.
Paved road / golden path	The supported, easy default that makes the right thing the easy thing.
Consensus	Alignment via argument/data/prototype, not authority.
Conway's Law	System structure mirrors org communication structure.
Tech radar	A curated view of adopt/trial/assess/hold technologies.
Code review	Peer review that gates quality and propagates standards.

5. Mental models

An ADR is a flight recorder for decisions. When something looks wrong later, you read the black box: what did we know, what did we weigh, what did we choose and why. Without it, every old decision is a mystery someone will "fix" into a regression.
Fitness functions are unit tests for the architecture. You don't trust developers to remember not to break the null-check; you write a test. Same with "no domain→infra dependency": don't trust reviews, write the fitness function. Green build = the architecture still holds.
An architect is a gardener, not a king. You don't command the plants to grow in rows; you set up trellises (paved roads), pull weeds (fitness functions), and prune (reviews) so the system grows the right shape on its own. Command-and-control architecture gets routed around.
Consensus is cheaper than authority. Mandating a standard creates malicious compliance and shadow workarounds; building consensus (write, prototype, review, let data win) creates adoption. The slow way is the fast way at org scale.

6. Common misconceptions

"The architect makes the decisions; teams implement them." That's the ivory-tower anti-pattern. Architects who don't build consensus get ignored; the role is influence + enablement, not command. The JD says "build consensus" for a reason.
"Document the architecture in a big wiki." Big design docs go stale the day they're written. ADRs (small, per-decision, immutable) + fitness functions (executable, always current) beat a 60-page wiki nobody reads.
"Reviews catch architecture violations." Humans miss cycles, layering breaks, and coupling creep under deadline pressure. Automate the objective checks (fitness functions); spend review time on judgment the machine can't make.
"More standards = better architecture." Standards without paved roads and enforcement are ignored; standards that make the right thing harder than the wrong thing actively backfire. Enable, then enforce mechanically — don't just publish rules.
"Architecture is up-front; then you build." Evolutionary architecture treats it as a continuously tested property — fitness functions let it change safely as requirements do, rather than ossify or rot.

7. Interview talking points

"How do you make and record architecture decisions?" ADRs (context, decision, alternatives-with-why, consequences) in the repo for the decision log; RFCs circulated before big/contested decisions to build consensus. The value is the preserved why — so decisions aren't re-litigated or accidentally reverted.
"How do you keep an architecture from rotting?" Fitness functions in CI: automated tests for no-cycles, layering, coupling budgets, contract compatibility (pa-02). The build fails on regression, so the architecture is enforced mechanically, not by hand. Cite evolutionary architecture.
"How do you drive a decision across teams you don't manage?" Influence, not authority: write the RFC, prototype to make it concrete, run the design review so the room owns the outcome, give the first teams a paved road, and let a fitness function enforce it afterward. Let data settle contested points.
"What makes a good design/code review?" Design review: a checklist pressure-testing boundaries, contracts, data/consistency, failure modes, rollback, and observability before build. Code review: teach and raise the bar, not just gate. Both are testing strategies; both are mentorship.
"How do you mentor / scale yourself?" Paved roads (the right thing is the easy default), fitness functions (enforcement without bottlenecking), and reviews-as-teaching. The architect's output is other engineers' improved output, not just their own code.
"Tell me about a wrong architecture decision." (Behavioral.) Own it, show the ADR/data that revealed it, and the migration (gw-12) that fixed it. Intellectual honesty + a recovery plan is the signal.

8. Connections to other labs

pa-01 / pa-02 — the cycle/layering and contract-compatibility checks here are the same analyses, packaged as CI fitness functions.
gw-12 — a wrong decision becomes a migration; ADRs record the call, the rollout ladder executes the change.
pa-08 (GitOps) — PR review of git changes is the change-control / consensus mechanism for the running system.
Every lab's docs/analysis.md — written as a model ADR/design- review ("tradeoffs worth flagging," "what production adds"): the decision-record habit applied throughout the book.

pa-10 — The Hitchhiker's Guide to Doing Architecture

Companion to CONCEPTS.md, with the runnable fitness- function engine in src/go/fitness/ and the ADR + design-review templates in steps/. This is the capstone: the practices that make an architect, not just a senior engineer.

bash scripts/verify.sh runs the fitness functions over a sample architecture:

=== unhealthy architecture ===
  fitness functions passed: false
    [layering]    orders (domain) -> pg (infra) violates layering
    [max-fan-out] orders has fan-out 7 > 4
    [no-cycles]   dependency cycle detected
=== healthy architecture (dependency inversion) ===
  fitness functions passed: true

Architecture, enforced by a green/red build — not by nagging in reviews.

1. Fitness functions: architecture as tests (fitness.go)

Every layered system rots: a deadline-pressured PR sneaks in a domain → infra dependency or a cycle, and the reviewer (human, tired) misses it. The fix is to make the architecture's invariants executable tests. Rules (NoCycles, Layering, MaxFanOut) each return violations; Evaluate(graph, rules) aggregates them into a Report whose Passed gates CI. TestEvaluateAggregatesAndGatesCI proves an unhealthy architecture fails; TestCleanArchitecturePasses proves a clean one (dependency-inverted: infra depends on domain, not the reverse) passes.

This is the same machinery as pa-01 (cycles, layering) and pa-02 (contract compat), packaged as a TestArchitecture you add to CI. It's the single most leveraged thing an architect can do: the design is enforced mechanically, forever, so reviews can spend their time on judgment the machine can't make. This is "evolutionary architecture" (Ford et al.) — architecture as a continuously-tested property that can change safely.

Build it into go test and the architecture can't regress without a red build. Real tools that do this: ArchUnit (JVM), import-linter (Python), go vet/depguard, dependency-cruiser (JS).

2. ADRs: capturing the why (steps/01)

A decision without a record is a decision someone will re-litigate or accidentally revert. An ADR captures context, the decision, alternatives considered (with why rejected), and consequences — short, immutable, in the repo. The template's most important section is "alternatives considered": what you rejected and why is the senior signal, and it's what stops the team re-deciding settled questions. Notice every lab's docs/analysis.md in this book is written as a mini-ADR ("tradeoffs worth flagging," "what production adds") — the habit applied throughout.

3. Design reviews & consensus (steps/02)

The design-review checklist pressure-tests a proposal before code: boundaries, contracts, data/consistency, failure modes, rollout, observability, evolution. It's deliberately the union of the whole book, and it doubles as the spine of a systems-design interview. The reviewer's job is to find the missed failure mode — and to teach; design review is mentorship at the architecture level.

The meta-skill the JD names is building consensus. The lab can't unit- test this, but the principle is firm: authority doesn't scale across teams you don't manage. Write the RFC, prototype to replace opinion with evidence, run the review so the room owns the decision, give the first teams a paved road, and let the fitness function (not your nagging) enforce it afterward. An architect is a gardener, not a king.

4. Why this is the capstone

Every prior lab built a thing; this lab builds the practice that makes those things cohere into a platform and survive 50 engineers and three years:

pa-01's cycle/layering analysis → a fitness function in CI.
pa-02's contract compat → a fitness function in CI.
gw-12's migrations → recorded as ADRs, executed via the rollout ladder.
pa-08's PR review → the consensus/change-control mechanism for the running system.
The whole book's analysis.md files → the ADR habit.

The architect's output isn't code; it's leverage: paved roads, automated enforcement, recorded decisions, and aligned teams that ship better, faster.

5. Hands-on

cd src/go
bash ../scripts/verify.sh
go run ./cmd/fitsim

Then write a real ADR (steps/01) and run the design-review checklist (steps/02) against something at work.

6. Exercises

Add a fitness function to a real repo: enforce "no import cycles" (or "package X must not import Y") as a failing go test / ArchUnit/import-linter rule; watch it catch a real violation.
Naming/ownership rules: add a Rule that flags services without an owner tag, or packages violating a naming convention.
Coupling budget over time: track total edges / max fan-out per release and fail CI if coupling grows beyond a budget.
Write the ADR for a Phase 6/7 design (e.g. the eventing backbone or the SLO framework) using steps/01; include the rejected alternatives.
Run a mock design review of one of your own designs with steps/02; note which boxes were assumed rather than answered — those are your risks.

pa-10 — References

Doing architecture

Building Evolutionary Architectures (Ford, Parsons, Kua) — fitness functions; architecture as a continuously-tested property.
Michael Nygard, Documenting Architecture Decisions (the ADR origin)
- the adr-tools convention. https://adr.github.io/
Software Architecture: The Hard Parts (Ford/Richards) — decision records, trade-off analysis, the architect's role.
The Software Architect Elevator (Hohpe) — connecting strategy and engineering; influence without authority.
Team Topologies (Skelton/Pais) — Conway's Law, paved roads, platform teams (why an architect's leverage is org-shaped).
Will Larson, Staff Engineer / An Elegant Puzzle — leading technical work, building consensus, RFCs.

Fitness-function tooling (real)

ArchUnit (JVM), import-linter (Python), dependency-cruiser (JS/TS), go vet / depguard / go mod graph (Go) — automated architecture rules in CI.
ThoughtWorks Tech Radar (adopt/trial/assess/hold).

Cross-lab links

pa-01 / pa-02 — the analyses packaged here as CI fitness functions.
gw-12 — decisions recorded as ADRs, executed as migrations.
pa-08 — PR review as change control/consensus.
Every lab's docs/analysis.md — the ADR/trade-off habit in practice.

pa-10 — Analysis

What the engine must get right

Rules return precise violations; Evaluate.Passed is false iff any rule is violated (the CI gate).
Cycle detection (DFS coloring) is sound; clean DAGs pass.
Layering / fan-out rules are data-driven and deterministic.

Design decisions

Rule as an interface. New architectural constraints (naming, ownership, coverage, perf budgets) plug in without touching the engine — the same extensibility as a linter.
Report aggregates per rule. CI output names which rule failed and where, so the violation is actionable, not just "build red."
Deterministic traversal. Sorted iteration makes violations stable across runs (reviewable diffs) — the book's determinism discipline.
Templates as deliverables. The ADR + design-review checklist (steps/) are first-class artifacts: the architect's written tools, not afterthoughts.

Tradeoffs worth flagging

Fitness functions enforce the objective, not the wise. They catch cycles and layering breaks; they can't judge whether a boundary is in the right place. Use them to free reviews for judgment, not replace judgment.
Too-strict rules get disabled. A fitness function that fires on legitimate exceptions trains people to add //nolint. Allow scoped, reviewed exceptions (with an ADR) rather than blanket suppression.
Consensus can't be automated. The hardest part of the role (aligning teams) has no unit test; the templates + practices are the leverage.
ADRs rot if not maintained. Supersede (don't edit); link related ADRs; keep them in the repo so they're versioned with the code.

What production adds beyond this lab

Real fitness-function tooling wired into CI (ArchUnit/import-linter/ depguard) over the actual import/service graph.
An ADR log + RFC process + an architecture review forum; a tech radar.
Paved roads (templates, libraries, golden paths) so the right thing is the easy default, plus mentorship and code-review standards.

pa-10 — Execution

Prerequisites

Go ≥ 1.25 (stdlib only, offline).

One-shot

cd pa-10-architecture-in-practice && bash scripts/verify.sh

Per-language workflow (Go)

cd pa-10-architecture-in-practice/src/go
go test -race -count=1 ./...      # no-cycles, layering, fan-out, aggregate gate
go run ./cmd/fitsim

What's here

Path	What
`fitness/fitness.go`	architecture fitness functions: NoCycles, Layering, MaxFanOut + Evaluate (CI gate)
`cmd/fitsim`	runs the rules over an unhealthy vs healthy architecture
`steps/01-adr-template.md`	the ADR (decision record) template
`steps/02-design-review-checklist.md`	the design-review checklist (also a systems-design-interview spine)

See GUIDE.md for the deep dive and the "architect as gardener" framing.

pa-10 — Verification

One command

cd pa-10-architecture-in-practice && bash scripts/verify.sh

What the tests prove

Test	Invariant
`TestNoCyclesPassesAndFails`	acyclic passes; a cycle is detected
`TestLayeringRule`	a forbidden layer dependency is flagged
`TestMaxFanOut`	exceeding the coupling budget is flagged; within it passes
`TestEvaluateAggregatesAndGatesCI`	an unhealthy architecture fails the aggregate gate with named violations
`TestCleanArchitecturePasses`	a dependency-inverted, acyclic, low-coupling design passes all rules

All under -race. Wire Evaluate into a real TestArchitecture and the build fails when the architecture regresses.

What "green" does NOT guarantee

Objective rules, not wisdom. Fitness functions catch cycles/ layering/coupling; they can't tell you a boundary is in the wrong place.
Consensus/mentorship aren't testable. The ADR + design-review templates (steps/) and the practices in CONCEPTS/GUIDE are the leverage.
Real enforcement is tool-specific (ArchUnit/import-linter/depguard over your actual graph).

pa-10 step 01 — The ADR (Architecture Decision Record) template

An ADR captures one architecturally-significant decision: the context, what you chose, what you rejected, and the consequences. Keep it short, immutable (supersede rather than edit), and in the repo next to the code. The value is the preserved why — so the decision isn't re-litigated or accidentally reverted six months later.

Rule of thumb: write an ADR when a decision is costly to reverse, affects multiple teams/services, or someone will ask "why did we do it this way?" Don't ADR trivial, easily-reversible choices.

Template

# ADR-0007: Use the transactional outbox for service-to-broker events

- Status: Accepted        # proposed | accepted | superseded by ADR-00NN | deprecated
- Date: 2026-06-14
- Deciders: platform-arch, payments-team, eng-leads
- Supersedes: —
- Related: ADR-0003 (eventing backbone = Kafka), pa-05, pa-03

## Context
What forces are at play? State the problem and the constraints — the
"-ilities" in tension, the scale, the existing decisions this builds on.

> Services must update their DB and publish a domain event. A best-effort
> dual write loses events on a crash (the dual-write problem), causing
> downstream inconsistency we've already been paged for. We need
> at-least-once publication tied to the DB commit, across ~40 services, in
> Go and Java, on our existing Postgres + Kafka.

## Decision
The choice, stated plainly.

> Adopt the **transactional outbox**: services write the event into an
> `outbox` table in the same DB transaction as the state change; a relay
> (CDC via Debezium where available, polling elsewhere) publishes to Kafka
> and marks rows sent. Consumers MUST be idempotent (dedup by event id).

## Alternatives considered
List the real options and *why each was rejected* — this is the part that
prevents re-litigation.

> - **Best-effort dual write** — rejected: loses events on crash (the
>   problem).
> - **Distributed 2PC (DB + Kafka)** — rejected: blocks on coordinator
>   failure, poor availability, operationally heavy.
> - **Event sourcing (log is source of truth)** — rejected for now: large
>   migration + team-readiness cost; revisit per-service.

## Consequences
What becomes easier, what becomes harder, and what you'll revisit.

> + No lost events; state and events never diverge.
> + Works on existing Postgres + Kafka; incremental per-service adoption.
> − At-least-once → every consumer needs idempotency (provide a shared lib).
> − A relay/CDC component to operate (monitoring, lag alerts).
> Revisit if: CDC lag becomes a problem, or a service needs event sourcing.

Tasks

Write an ADR for a decision in your system using this template.
Find an old decision whose reasoning is now lost — write its ADR retroactively and notice how much context was nearly gone.
Practice the "Alternatives considered" section: the senior signal is what you rejected and why, not just what you chose.

pa-10 step 02 — The design-review checklist

A design review pressure-tests a proposal before code is written, so the expensive mistakes (a boundary in the wrong place, a missing failure mode, no rollback path) surface when they're cheap to fix. The reviewer's job is to find the thing the author didn't think of — kindly. This checklist doubles as a structure for the systems-design interview round (see [gw-00 / pa-00 INTERVIEW.md]).

The checklist

Problem & scope

What are we optimizing — which "-ilities" (latency, throughput, consistency, cost, dev velocity)? Stated and prioritized?
Scale: RPS, data size, event rate, # teams/consumers, growth?
What's explicitly out of scope / a non-goal?

Boundaries & contracts (pa-01, pa-02)

Services split by bounded context, not technical layer or table?
Any dependency cycles / shared databases / distributed-monolith smells? (Run the fitness function, pa-01/pa-10.)
Contracts explicit and versioned? Compatibility plan? (pa-02)
Sync vs async chosen per edge, with the cost owned? (pa-03)

Data & consistency (pa-04, pa-05, pa-06)

Partitioning strategy + key (skew/hotspots)? Resharding cost?
Consistency model per dataset (the weakest that's correct)? CAP/PACELC?
Dual-write avoided (outbox)? Cross-service workflow = saga, not 2PC?
Delivery semantics (at-least-once + idempotent consumers)? Ordering?

Failure & reliability (pa-09, gw-06)

Failure modes enumerated? Blast radius (pa-01) understood?
Timeouts, retries (+budget+jitter), circuit breakers, bulkheads, backpressure, load shedding?
SLOs + error budget defined? What degrades first (graceful degradation)?
Metastable-failure / retry-storm risk considered?

Delivery & operability (pa-07, pa-08, gw-11, gw-12)

IaC + GitOps? Progressive delivery + automatic rollback?
Observability: RED/USE metrics, tracing across async hops, the alerting (symptom-based, not cause-based)?
Migration plan (strangler-fig / shadow → canary → ramp)? Rollback tested?

Evolution & people

What changes are likely, and does the design absorb them (evolutionary architecture)?
Which invariants become fitness functions in CI?
Is there an ADR for the key decisions and rejected alternatives?
Security/authz (gw-07), data privacy, multi-tenancy?

How to run it well

Send the doc + this checklist ahead; review the document, not the person.
Drive toward a decision (and an ADR), not an open-ended discussion.
The reviewer adds value by finding the missed failure mode and by teaching — design review is mentorship at the architecture level.
Disagree-and-commit: record dissent, decide, move; revisit if data changes.

Tasks

Run this checklist against a recent design at work; count how many boxes were implicitly assumed vs explicitly answered.
Use it as the spine for a mock systems-design interview (pa-00 INTERVIEW.md) and notice it covers every round.