Analysis — Storage Primitives

This document is for the design decisions and the trade-offs. The CONCEPTS.md told you what exists; this tells you why we picked one over the other and what you'd reach for in different conditions.


Decision 1: pread/pwrite over read/write

We use pread/pwrite (explicit offsets) instead of read/write + lseek.

Aspectread/write + lseekpread/pwrite
Thread safety on shared fdUnsafe — file pointer is shared, lseek+read racesSafe — offset is per-call
Syscalls per op2 (lseek + read)1
Mental modelStateful cursorStateless random access
Used bySingle-threaded streaming codeAll real databases (SQLite, LevelDB, Postgres)

Verdict: pread/pwrite is strictly better for database-style access patterns. The only reason to use the cursor variant is when you genuinely have a single sequential reader (e.g., tail -f).


Decision 2: pread/pwrite over mmap

This is more nuanced. We use explicit I/O for all labs except where we deliberately study mmap.

Aspectmmappread/pwrite
Code complexityLower (pointer access)Higher (explicit calls)
Latency predictabilityBad — page faults are synchronous, can stall on cold pagesGood — every cost is visible in the syscall
Write durabilityTricky — msync(MS_SYNC) is expensive and synchronizes the whole mappingSurgical — fdatasync(fd)
Memory accountingCounts as anonymous memory; hard to reason about WSSBuffers are yours, you bound them
Large files (> RAM)Catastrophic — random page-in stormsFine — you read what you need
Multi-threaded scalingPage-fault locks scale poorlyLinear scaling with cores
TLB pressureHugepages help but transparent hugepage transitions can pause processesNone
Used byLMDB, BoltDBSQLite, LevelDB, RocksDB, Postgres

The Pavlo et al. CIDR 2022 paper (linked in references.md) is the definitive teardown. TL;DR: mmap is fine when (a) the dataset fits in RAM, (b) the workload is read-heavy, and (c) you don't care about latency tails. For everything else, pread/pwrite wins.


Decision 3: Page Size = 4 KiB

We pick 4 KiB as the default page size in this lab and reconsider in later labs.

Page sizeProsCons
512 BOld HDD sector; small writes are cheapTiny metadata-to-data ratio, lots of indirection
4 KiBMatches OS page, NVMe LBA, page cache. Sweet spot for OLTP.Small for analytics
8 KiBPostgres default. Better for slightly larger rows.Wastes I/O for tiny tuples
16 KiBMySQL InnoDB default. Good index fanout.One row update = 16 KiB write
64 KiB / 1 MiBAnalytics, sequential scans, Parquet row groupsTerrible for random updates

Rule of thumb: page size ≈ the device sector size × small constant. With NVMe at 4 KiB LBA and the OS page also at 4 KiB, going smaller is fighting the hardware and going larger is amortizing a smaller win.


Decision 4: Endianness on Disk

Our on-disk format is little-endian. Justified by:

  1. x86 and ARM (in normal mode) are little-endian. Big-endian on these platforms means a byte swap on every read.
  2. Network protocols use big-endian by convention, but our on-disk format is not a network protocol — it's only read by the same machine (or by an explicit migration tool).
  3. LevelDB and RocksDB use little-endian for fixed-width fields, with varints for variable-width. We follow that convention for compatibility of mental model.
  4. SQLite uses big-endian for historical reasons (the format dates to 2000, when MIPS/PowerPC/SPARC were still common). It's a legitimate alternative; we just optimize for the modern hardware reality.

Always use explicit conversion functions at the I/O boundary. Never memcpy an int to disk and hope. Our Rust code uses u64::to_le_bytes; Go uses encoding/binary.LittleEndian; C++20 uses std::endian + std::byteswap.


Decision 5: When to call fsync

The cost of fsync on consumer NVMe is roughly:

  • Single-threaded latency: ~50 µs–1 ms depending on outstanding writes
  • Throughput-limited: roughly 5,000–20,000 fsync/sec before contention dominates

The cost on a HDD is 5–15 ms per fsync. That's why ye olde databases did group commit.

The right policy depends on durability requirements:

PolicyWhat survives a crashThroughput cost
No fsyncNothing reliably (kernel may flush eventually)None
fsync per writeEvery acknowledged writeMassive — one syscall per write
fsync per N writesLast (N-1) writes possibly lost1/N the cost
Group commitEvery acknowledged write; latency = time-to-batch + fsyncExcellent — best of both
fsync periodically (e.g., 100 ms)Last 100 ms of writes possibly lost (MySQL innodb_flush_log_at_trx_commit=2)Excellent

The right design for Phase 2's WAL is group commit: when a writer finishes, it waits on a condition variable; the WAL writer thread pwrites pending records, fdatasyncs once, then wakes every waiter. We'll build this in db-03.

macOS Caveat

On macOS, fsync(fd) does not flush the drive's write cache — it only sends the data to the drive. To get true durability you must call fcntl(fd, F_FULLFSYNC), which can be 10–100× slower than fsync on the same hardware. SQLite, LevelDB, and Postgres all handle this. Our wrapper in src/*/fsync_full.* does the platform dispatch.


Decision 6: O_DIRECT — Not in This Lab

We don't use O_DIRECT in Lab 01 because:

  1. It requires aligned buffers (typically 4 KiB), aligned offsets, and aligned I/O sizes.
  2. It bypasses the page cache, so you must implement your own — which is a buffer manager (Phase 3, db-11).
  3. It's not available on macOS — you'd use fcntl(fd, F_NOCACHE, 1) as the closest analogue, but it has weaker semantics.

We revisit O_DIRECT in db-21-storage-engine-advanced once we have a buffer manager worth talking about.


Hardware Numbers Cheat-Sheet

Memorize these. They drive every storage design decision:

L1 cache hit            1   ns
Branch mispredict       3   ns
L2 cache hit           ~4   ns
L3 cache hit          ~15   ns
DRAM access          ~100   ns       — 100× L1
Context switch       ~1–5  µs
NVMe random read      ~50  µs        — 500× DRAM
NVMe sequential read  ~5   µs/4KB
SATA SSD random read ~100  µs
SATA SSD seq read    ~10   µs/4KB
HDD random read       ~5   ms        — 100,000× DRAM
HDD sequential read  ~30   µs/4KB    (~150 MB/s)
fsync on NVMe       ~50 µs–1 ms
fsync on HDD          ~10 ms
F_FULLFSYNC (macOS)   ~10–50 ms      — actually flushes drive cache
Network RTT same DC   ~500 µs
Network RTT same region ~1 ms
Network RTT cross-region ~50–150 ms  — drives Raft heartbeat tuning in db-17

Five-order-of-magnitude gaps are where the design changes. Between L1 and DRAM (100×), you can ignore it. Between DRAM and disk (1000×), you can't. Between disk and network cross-region (1000× again), distributed systems get hard.


What Breaks at Scale

  • Filesystem journal contention: fsync on ext4 serializes through the FS journal. Many concurrent fsyncs on the same FS don't scale linearly. Mitigation: one WAL file per database, dedicated FS for WAL.
  • Page cache thrashing: when working set > RAM, every pread is a miss. The kernel's LRU is generic; an app-aware cache (Phase 2's block cache, Phase 3's pager) does better.
  • fsync failure handling: on Linux, a failed fsync can mark dirty pages as clean — silently losing your data. This is the "fsyncgate" referenced in the references. Mitigation: panic on fsync error and crash-recover from the WAL (modern Postgres does this).
  • NVMe queue depth: NVMe shines with QD=32–128 in flight. A single-threaded pread loop runs at QD=1 and leaves most of the drive idle. io_uring (Phase 5) fixes this.