Storage Primitives

The lab that earns you the right to talk about databases.

1. What Is It

This lab teaches the physical layer that every storage engine sits on top of: how data moves between a process's memory and a block device. You will learn the OS page model, the byte-order question (endianness), the three main file I/O styles (read/write, pread/pwrite, mmap, O_DIRECT), buffer alignment, and the durability primitive fsync. You'll also internalize the latency numbers for HDD, SATA SSD, and NVMe SSD that drive every storage design decision in the rest of the curriculum. The deliverable is a tiny page allocator plus a hexdump utility, written three times — once in Rust, once in Go, and once in C++ — exercising pread/pwrite against a real disk file.

2. Why It Matters

  • Every later lab depends on these primitives. LSM-trees, B-trees, WALs — they're all built on pread/pwrite/fsync and an understanding of the page cache.
  • Choosing the right I/O style changes throughput by 10–100×. A naïve read loop is not the same as pread from many threads, which is not the same as io_uring, which is not the same as mmap.
  • Hardware shapes the algorithm. LSM-trees exist because random writes on HDDs were catastrophic. NVMe IOPS now make some classic assumptions wrong. Knowing the numbers prevents cargo-culting designs from the wrong decade.
  • fsync is the single most expensive syscall in any database. Understanding when it must be called — and when you can amortize it — is the difference between 100 commits/sec and 100,000 commits/sec.

3. How It Works

                  User process
   ┌───────────────────────────────────────────────────┐
   │   Your code: page_allocator, db.put("key", val)   │
   │           buffer = [u8; PAGE_SIZE]                │
   └────────────┬───────────────────┬──────────────────┘
                │                   │
                │  pread/pwrite     │  mmap
                │  (explicit copy)  │  (page-fault driven)
                ▼                   ▼
        ┌─────────────────────────────────────┐
        │       Kernel page cache (RAM)       │  ← cached pages,
        │   4 KiB pages, indexed by inode+off │     LRU-evicted
        └────────────────┬────────────────────┘
                         │  block layer
                         │  (scheduler, mq-deadline / none for NVMe)
                         ▼
                ┌─────────────────────┐
                │  Device driver      │  fsync() blocks here
                │  (NVMe / SATA AHCI) │  until disk acks
                └─────────┬───────────┘
                          ▼
                ┌─────────────────────┐
                │  Storage hardware   │  HDD:  ~5 ms  random
                │  HDD / SSD / NVMe   │  SSD:  ~100 µs random
                │                     │  NVMe: ~50  µs random
                └─────────────────────┘

Three things to internalize from this picture:

  1. Without O_DIRECT, you always go through the kernel page cache. Your pread may hit a warm cache (memcpy speed) or cold cache (full disk I/O). Latency variance is enormous.
  2. fsync is the only way to tell the device to flush its own write cache. Without it, "the write returned" means "the kernel accepted it", not "it survives a power loss".
  3. mmap and pread are fundamentally different mental models. mmap makes I/O implicit (page faults), pread makes it explicit (syscalls). LMDB chose mmap. SQLite, LevelDB, and PostgreSQL chose pread/pwrite. We will use pread/pwrite for predictability, and discuss mmap in the analysis.

4. Core Terminology

TermDefinition
PageFixed-size unit of I/O between user and storage. The kernel uses 4 KiB; databases pick 4–32 KiB. We use 4 KiB.
Page cacheKernel-managed RAM that mirrors recently accessed file pages. Transparent to read/write and pread/pwrite.
pread(fd, buf, n, off)Read n bytes from fd starting at byte offset off. Does not affect the file pointer. Thread-safe.
pwrite(fd, buf, n, off)Write n bytes to fd at byte offset off. Thread-safe.
mmapMap a file region into the process's address space. Accesses become loads/stores; faults trigger page-ins.
fsync(fd)Block until all dirty data and metadata for fd are on stable storage. The durability primitive.
fdatasync(fd)Like fsync but may skip metadata updates that aren't required to retrieve the data.
O_DIRECTOpen flag (Linux) that bypasses the page cache. Requires 512-byte or 4-KiB alignment on buffers, offsets, sizes.
F_FULLFSYNCmacOS-only fcntl that actually flushes the drive's cache. fsync on macOS is not enough for true durability.
EndiannessByte order of multi-byte integers in memory. Little-endian = LSB first (x86, ARM default); big-endian = MSB first (network byte order).
AlignmentAn address being a multiple of N. Matters for SIMD, DMA, O_DIRECT, and many hardware operations.
SectorThe atomic write unit of the device. HDDs: 512 B (legacy) or 4 KiB (Advanced Format). NVMe: usually 4 KiB.
IOPSI/O operations per second. The right unit for random workloads (HDD ~150, SATA SSD ~80K, NVMe ~500K–1M).
LatencyTime for one operation to complete. Often what users actually feel; throughput hides tail behavior.

5. Mental Models

The page cache is a transparent cache, not a database. Think of pread like Map::get: if the key is in the cache, it's a memcpy; if it's not, the kernel goes to disk for you. You can't observe a cache miss with timing alone in production — that's the whole point of caches and the whole reason benchmarks lie.

fsync is a phone call to the disk. All other writes are "I told the postman" — fast, no guarantee. fsync is "I waited on the line while the disk confirmed the package arrived." Phone calls are slow. Group commits = bundling 100 packages into one call.

mmap is "make the file look like an array". pread is "I will ask for bytes one request at a time". The first is convenient. The second is predictable. Convenience and predictability are usually at war.

Sequential vs random I/O on an HDD is 100× different. On NVMe it's 2–3×. This is why LSM-trees won the 2000s and why "just append" got rediscovered in the 2010s and why NVMe makes some of those assumptions less critical in the 2020s. Hardware shapes design.

6. Common Misconceptions

  1. "write returning means my data is safe." False. The kernel buffered it. Only fsync (or fdatasync for data-only, or F_FULLFSYNC on macOS) guarantees durability.
  2. "mmap is faster than pread because there's no syscall." Often false. mmap access generates page faults, which are also context switches into the kernel, plus they're synchronous (you can't overlap them with computation as easily). LMDB-style designs win when the working set fits in RAM; they suffer on writes due to fsyncing the mapping.
  3. "SSDs make random vs sequential irrelevant." Partially true. Random reads are fast, but random writes still incur garbage collection and write amplification at the firmware level. Sequential writes still reduce hardware WA significantly.
  4. "4 KiB is always the right page size." No. It matches OS page size, which is friendly for mmap and for the page cache. But LevelDB uses 4 KiB blocks (read amp) and 64 MiB SSTables (sequential writes). PostgreSQL uses 8 KiB pages. The "right" page size depends on workload.
  5. "fsync flushes only my file." On many filesystems and many older kernels, fsync could flush more (or less) than expected. Modern ext4/xfs are sane, but historical PostgreSQL fsync bugs (2018) showed that the contract is more subtle than it looks.

7. Interview Talking Points

  • "For a write-heavy OLTP workload on local NVMe, I'd start with direct pwrite + fdatasync rather than mmap. mmap makes durable writes ambiguous — msync(MS_SYNC) is a heavier hammer than fdatasync because it covers the whole mapping, and you give up control over write ordering."
  • "My rule of thumb: HDD random read ≈ 5 ms, SATA SSD ≈ 100 µs, NVMe ≈ 50 µs, RAM ≈ 100 ns, L1 ≈ 1 ns. Every five orders of magnitude is where a different design becomes interesting. LSM-trees collapse the gap between random and sequential by converting random writes to sequential ones."
  • "fsync is what amortizes the difference between 100 commits/sec and 100,000 commits/sec. Group commit batches N concurrent transactions into one fsync, trading latency (one transaction may wait ~5 ms for a batch) for throughput (100× more committed transactions per fsync). Postgres, MySQL InnoDB, and SQLite all do this."
  • "O_DIRECT isn't a free win. You skip the page cache, so you have to implement your own cache and your buffers must be aligned. PostgreSQL deliberately uses the page cache and lets the OS do that work for it. Oracle and Sybase use O_DIRECT. The choice depends on whether you trust your buffer manager more than the kernel's."

8. Connections to Other Labs

  • db-02 — uses the page-aligned allocator from here for skip-list and hash-table node storage.
  • db-03 — the WAL is literally pwrite + fdatasync in a loop; this lab gives you the muscle memory.
  • db-06 — SSTable blocks are read via pread at known offsets; this lab is the read side.
  • db-11 — the SQLite pager is a pread/pwrite-based page cache; you'll reimplement what the kernel does for you here.
  • db-21 — revisits I/O with io_uring (Linux) and O_DIRECT for the advanced cases; this lab establishes the baseline.