Analysis — Storage Primitives

This document is for the design decisions and the trade-offs. The CONCEPTS.md told you what exists; this tells you why we picked one over the other and what you'd reach for in different conditions.

Decision 1: `pread`/`pwrite` over `read`/`write`

We use pread/pwrite (explicit offsets) instead of read/write + lseek.

Aspect	`read`/`write` + `lseek`	`pread`/`pwrite`
Thread safety on shared `fd`	Unsafe — file pointer is shared, `lseek`+`read` races	Safe — offset is per-call
Syscalls per op	2 (lseek + read)	1
Mental model	Stateful cursor	Stateless random access
Used by	Single-threaded streaming code	All real databases (SQLite, LevelDB, Postgres)

Verdict: pread/pwrite is strictly better for database-style access patterns. The only reason to use the cursor variant is when you genuinely have a single sequential reader (e.g., tail -f).

Decision 2: `pread`/`pwrite` over `mmap`

This is more nuanced. We use explicit I/O for all labs except where we deliberately study mmap.

Aspect	`mmap`	`pread`/`pwrite`
Code complexity	Lower (pointer access)	Higher (explicit calls)
Latency predictability	Bad — page faults are synchronous, can stall on cold pages	Good — every cost is visible in the syscall
Write durability	Tricky — `msync(MS_SYNC)` is expensive and synchronizes the whole mapping	Surgical — `fdatasync(fd)`
Memory accounting	Counts as anonymous memory; hard to reason about WSS	Buffers are yours, you bound them
Large files (> RAM)	Catastrophic — random page-in storms	Fine — you read what you need
Multi-threaded scaling	Page-fault locks scale poorly	Linear scaling with cores
TLB pressure	Hugepages help but transparent hugepage transitions can pause processes	None
Used by	LMDB, BoltDB	SQLite, LevelDB, RocksDB, Postgres

The Pavlo et al. CIDR 2022 paper (linked in references.md) is the definitive teardown. TL;DR: mmap is fine when (a) the dataset fits in RAM, (b) the workload is read-heavy, and (c) you don't care about latency tails. For everything else, pread/pwrite wins.

Decision 3: Page Size = 4 KiB

We pick 4 KiB as the default page size in this lab and reconsider in later labs.

Page size	Pros	Cons
512 B	Old HDD sector; small writes are cheap	Tiny metadata-to-data ratio, lots of indirection
4 KiB	Matches OS page, NVMe LBA, page cache. Sweet spot for OLTP.	Small for analytics
8 KiB	Postgres default. Better for slightly larger rows.	Wastes I/O for tiny tuples
16 KiB	MySQL InnoDB default. Good index fanout.	One row update = 16 KiB write
64 KiB / 1 MiB	Analytics, sequential scans, Parquet row groups	Terrible for random updates

Rule of thumb: page size ≈ the device sector size × small constant. With NVMe at 4 KiB LBA and the OS page also at 4 KiB, going smaller is fighting the hardware and going larger is amortizing a smaller win.

Decision 4: Endianness on Disk

Our on-disk format is little-endian. Justified by:

x86 and ARM (in normal mode) are little-endian. Big-endian on these platforms means a byte swap on every read.
Network protocols use big-endian by convention, but our on-disk format is not a network protocol — it's only read by the same machine (or by an explicit migration tool).
LevelDB and RocksDB use little-endian for fixed-width fields, with varints for variable-width. We follow that convention for compatibility of mental model.
SQLite uses big-endian for historical reasons (the format dates to 2000, when MIPS/PowerPC/SPARC were still common). It's a legitimate alternative; we just optimize for the modern hardware reality.

Always use explicit conversion functions at the I/O boundary. Never memcpy an int to disk and hope. Our Rust code uses u64::to_le_bytes; Go uses encoding/binary.LittleEndian; C++20 uses std::endian + std::byteswap.

Decision 5: When to call `fsync`

The cost of fsync on consumer NVMe is roughly:

Single-threaded latency: ~50 µs–1 ms depending on outstanding writes
Throughput-limited: roughly 5,000–20,000 fsync/sec before contention dominates

The cost on a HDD is 5–15 ms per fsync. That's why ye olde databases did group commit.

The right policy depends on durability requirements:

Policy	What survives a crash	Throughput cost
No `fsync`	Nothing reliably (kernel may flush eventually)	None
`fsync` per write	Every acknowledged write	Massive — one syscall per write
`fsync` per N writes	Last (N-1) writes possibly lost	1/N the cost
Group commit	Every acknowledged write; latency = time-to-batch + `fsync`	Excellent — best of both
`fsync` periodically (e.g., 100 ms)	Last 100 ms of writes possibly lost (MySQL `innodb_flush_log_at_trx_commit=2`)	Excellent

The right design for Phase 2's WAL is group commit: when a writer finishes, it waits on a condition variable; the WAL writer thread pwrites pending records, fdatasyncs once, then wakes every waiter. We'll build this in db-03.

macOS Caveat

On macOS, fsync(fd) does not flush the drive's write cache — it only sends the data to the drive. To get true durability you must call fcntl(fd, F_FULLFSYNC), which can be 10–100× slower than fsync on the same hardware. SQLite, LevelDB, and Postgres all handle this. Our wrapper in src/*/fsync_full.* does the platform dispatch.

Decision 6: `O_DIRECT` — Not in This Lab

We don't use O_DIRECT in Lab 01 because:

It requires aligned buffers (typically 4 KiB), aligned offsets, and aligned I/O sizes.
It bypasses the page cache, so you must implement your own — which is a buffer manager (Phase 3, db-11).
It's not available on macOS — you'd use fcntl(fd, F_NOCACHE, 1) as the closest analogue, but it has weaker semantics.

We revisit O_DIRECT in db-21-storage-engine-advanced once we have a buffer manager worth talking about.

Hardware Numbers Cheat-Sheet

Memorize these. They drive every storage design decision:

L1 cache hit            1   ns
Branch mispredict       3   ns
L2 cache hit           ~4   ns
L3 cache hit          ~15   ns
DRAM access          ~100   ns       — 100× L1
Context switch       ~1–5  µs
NVMe random read      ~50  µs        — 500× DRAM
NVMe sequential read  ~5   µs/4KB
SATA SSD random read ~100  µs
SATA SSD seq read    ~10   µs/4KB
HDD random read       ~5   ms        — 100,000× DRAM
HDD sequential read  ~30   µs/4KB    (~150 MB/s)
fsync on NVMe       ~50 µs–1 ms
fsync on HDD          ~10 ms
F_FULLFSYNC (macOS)   ~10–50 ms      — actually flushes drive cache
Network RTT same DC   ~500 µs
Network RTT same region ~1 ms
Network RTT cross-region ~50–150 ms  — drives Raft heartbeat tuning in db-17

Five-order-of-magnitude gaps are where the design changes. Between L1 and DRAM (100×), you can ignore it. Between DRAM and disk (1000×), you can't. Between disk and network cross-region (1000× again), distributed systems get hard.

What Breaks at Scale

Filesystem journal contention: fsync on ext4 serializes through the FS journal. Many concurrent fsyncs on the same FS don't scale linearly. Mitigation: one WAL file per database, dedicated FS for WAL.
Page cache thrashing: when working set > RAM, every pread is a miss. The kernel's LRU is generic; an app-aware cache (Phase 2's block cache, Phase 3's pager) does better.
fsync failure handling: on Linux, a failed fsync can mark dirty pages as clean — silently losing your data. This is the "fsyncgate" referenced in the references. Mitigation: panic on fsync error and crash-recover from the WAL (modern Postgres does this).
NVMe queue depth: NVMe shines with QD=32–128 in flight. A single-threaded pread loop runs at QD=1 and leaves most of the drive idle. io_uring (Phase 5) fixes this.

Distributed Systems Engineer — Build Databases & Consensus From Scratch