Analysis — Storage Primitives
This document is for the design decisions and the trade-offs. The CONCEPTS.md told you what exists; this tells you why we picked one over the other and what you'd reach for in different conditions.
Decision 1: pread/pwrite over read/write
We use pread/pwrite (explicit offsets) instead of read/write + lseek.
| Aspect | read/write + lseek | pread/pwrite |
|---|---|---|
Thread safety on shared fd | Unsafe — file pointer is shared, lseek+read races | Safe — offset is per-call |
| Syscalls per op | 2 (lseek + read) | 1 |
| Mental model | Stateful cursor | Stateless random access |
| Used by | Single-threaded streaming code | All real databases (SQLite, LevelDB, Postgres) |
Verdict: pread/pwrite is strictly better for database-style access patterns. The only reason to use the cursor variant is when you genuinely have a single sequential reader (e.g., tail -f).
Decision 2: pread/pwrite over mmap
This is more nuanced. We use explicit I/O for all labs except where we deliberately study mmap.
| Aspect | mmap | pread/pwrite |
|---|---|---|
| Code complexity | Lower (pointer access) | Higher (explicit calls) |
| Latency predictability | Bad — page faults are synchronous, can stall on cold pages | Good — every cost is visible in the syscall |
| Write durability | Tricky — msync(MS_SYNC) is expensive and synchronizes the whole mapping | Surgical — fdatasync(fd) |
| Memory accounting | Counts as anonymous memory; hard to reason about WSS | Buffers are yours, you bound them |
| Large files (> RAM) | Catastrophic — random page-in storms | Fine — you read what you need |
| Multi-threaded scaling | Page-fault locks scale poorly | Linear scaling with cores |
| TLB pressure | Hugepages help but transparent hugepage transitions can pause processes | None |
| Used by | LMDB, BoltDB | SQLite, LevelDB, RocksDB, Postgres |
The Pavlo et al. CIDR 2022 paper (linked in references.md) is the definitive teardown. TL;DR: mmap is fine when (a) the dataset fits in RAM, (b) the workload is read-heavy, and (c) you don't care about latency tails. For everything else, pread/pwrite wins.
Decision 3: Page Size = 4 KiB
We pick 4 KiB as the default page size in this lab and reconsider in later labs.
| Page size | Pros | Cons |
|---|---|---|
| 512 B | Old HDD sector; small writes are cheap | Tiny metadata-to-data ratio, lots of indirection |
| 4 KiB | Matches OS page, NVMe LBA, page cache. Sweet spot for OLTP. | Small for analytics |
| 8 KiB | Postgres default. Better for slightly larger rows. | Wastes I/O for tiny tuples |
| 16 KiB | MySQL InnoDB default. Good index fanout. | One row update = 16 KiB write |
| 64 KiB / 1 MiB | Analytics, sequential scans, Parquet row groups | Terrible for random updates |
Rule of thumb: page size ≈ the device sector size × small constant. With NVMe at 4 KiB LBA and the OS page also at 4 KiB, going smaller is fighting the hardware and going larger is amortizing a smaller win.
Decision 4: Endianness on Disk
Our on-disk format is little-endian. Justified by:
- x86 and ARM (in normal mode) are little-endian. Big-endian on these platforms means a byte swap on every read.
- Network protocols use big-endian by convention, but our on-disk format is not a network protocol — it's only read by the same machine (or by an explicit migration tool).
- LevelDB and RocksDB use little-endian for fixed-width fields, with varints for variable-width. We follow that convention for compatibility of mental model.
- SQLite uses big-endian for historical reasons (the format dates to 2000, when MIPS/PowerPC/SPARC were still common). It's a legitimate alternative; we just optimize for the modern hardware reality.
Always use explicit conversion functions at the I/O boundary. Never memcpy an int to disk and hope. Our Rust code uses u64::to_le_bytes; Go uses encoding/binary.LittleEndian; C++20 uses std::endian + std::byteswap.
Decision 5: When to call fsync
The cost of fsync on consumer NVMe is roughly:
- Single-threaded latency: ~50 µs–1 ms depending on outstanding writes
- Throughput-limited: roughly 5,000–20,000
fsync/sec before contention dominates
The cost on a HDD is 5–15 ms per fsync. That's why ye olde databases did group commit.
The right policy depends on durability requirements:
| Policy | What survives a crash | Throughput cost |
|---|---|---|
No fsync | Nothing reliably (kernel may flush eventually) | None |
fsync per write | Every acknowledged write | Massive — one syscall per write |
fsync per N writes | Last (N-1) writes possibly lost | 1/N the cost |
| Group commit | Every acknowledged write; latency = time-to-batch + fsync | Excellent — best of both |
fsync periodically (e.g., 100 ms) | Last 100 ms of writes possibly lost (MySQL innodb_flush_log_at_trx_commit=2) | Excellent |
The right design for Phase 2's WAL is group commit: when a writer finishes, it waits on a condition variable; the WAL writer thread pwrites pending records, fdatasyncs once, then wakes every waiter. We'll build this in db-03.
macOS Caveat
On macOS, fsync(fd) does not flush the drive's write cache — it only sends the data to the drive. To get true durability you must call fcntl(fd, F_FULLFSYNC), which can be 10–100× slower than fsync on the same hardware. SQLite, LevelDB, and Postgres all handle this. Our wrapper in src/*/fsync_full.* does the platform dispatch.
Decision 6: O_DIRECT — Not in This Lab
We don't use O_DIRECT in Lab 01 because:
- It requires aligned buffers (typically 4 KiB), aligned offsets, and aligned I/O sizes.
- It bypasses the page cache, so you must implement your own — which is a buffer manager (Phase 3,
db-11). - It's not available on macOS — you'd use
fcntl(fd, F_NOCACHE, 1)as the closest analogue, but it has weaker semantics.
We revisit O_DIRECT in db-21-storage-engine-advanced once we have a buffer manager worth talking about.
Hardware Numbers Cheat-Sheet
Memorize these. They drive every storage design decision:
L1 cache hit 1 ns
Branch mispredict 3 ns
L2 cache hit ~4 ns
L3 cache hit ~15 ns
DRAM access ~100 ns — 100× L1
Context switch ~1–5 µs
NVMe random read ~50 µs — 500× DRAM
NVMe sequential read ~5 µs/4KB
SATA SSD random read ~100 µs
SATA SSD seq read ~10 µs/4KB
HDD random read ~5 ms — 100,000× DRAM
HDD sequential read ~30 µs/4KB (~150 MB/s)
fsync on NVMe ~50 µs–1 ms
fsync on HDD ~10 ms
F_FULLFSYNC (macOS) ~10–50 ms — actually flushes drive cache
Network RTT same DC ~500 µs
Network RTT same region ~1 ms
Network RTT cross-region ~50–150 ms — drives Raft heartbeat tuning in db-17
Five-order-of-magnitude gaps are where the design changes. Between L1 and DRAM (100×), you can ignore it. Between DRAM and disk (1000×), you can't. Between disk and network cross-region (1000× again), distributed systems get hard.
What Breaks at Scale
- Filesystem journal contention:
fsyncon ext4 serializes through the FS journal. Many concurrentfsyncs on the same FS don't scale linearly. Mitigation: one WAL file per database, dedicated FS for WAL. - Page cache thrashing: when working set > RAM, every
preadis a miss. The kernel's LRU is generic; an app-aware cache (Phase 2's block cache, Phase 3's pager) does better. fsyncfailure handling: on Linux, a failedfsynccan mark dirty pages as clean — silently losing your data. This is the "fsyncgate" referenced in the references. Mitigation: panic onfsyncerror and crash-recover from the WAL (modern Postgres does this).- NVMe queue depth: NVMe shines with QD=32–128 in flight. A single-threaded
preadloop runs at QD=1 and leaves most of the drive idle.io_uring(Phase 5) fixes this.