Storage Primitives
The lab that earns you the right to talk about databases.
1. What Is It
This lab teaches the physical layer that every storage engine sits on top of: how data moves between a process's memory and a block device. You will learn the OS page model, the byte-order question (endianness), the three main file I/O styles (read/write, pread/pwrite, mmap, O_DIRECT), buffer alignment, and the durability primitive fsync. You'll also internalize the latency numbers for HDD, SATA SSD, and NVMe SSD that drive every storage design decision in the rest of the curriculum. The deliverable is a tiny page allocator plus a hexdump utility, written three times — once in Rust, once in Go, and once in C++ — exercising pread/pwrite against a real disk file.
2. Why It Matters
- Every later lab depends on these primitives. LSM-trees, B-trees, WALs — they're all built on
pread/pwrite/fsyncand an understanding of the page cache. - Choosing the right I/O style changes throughput by 10–100×. A naïve
readloop is not the same aspreadfrom many threads, which is not the same asio_uring, which is not the same asmmap. - Hardware shapes the algorithm. LSM-trees exist because random writes on HDDs were catastrophic. NVMe IOPS now make some classic assumptions wrong. Knowing the numbers prevents cargo-culting designs from the wrong decade.
fsyncis the single most expensive syscall in any database. Understanding when it must be called — and when you can amortize it — is the difference between 100 commits/sec and 100,000 commits/sec.
3. How It Works
User process
┌───────────────────────────────────────────────────┐
│ Your code: page_allocator, db.put("key", val) │
│ buffer = [u8; PAGE_SIZE] │
└────────────┬───────────────────┬──────────────────┘
│ │
│ pread/pwrite │ mmap
│ (explicit copy) │ (page-fault driven)
▼ ▼
┌─────────────────────────────────────┐
│ Kernel page cache (RAM) │ ← cached pages,
│ 4 KiB pages, indexed by inode+off │ LRU-evicted
└────────────────┬────────────────────┘
│ block layer
│ (scheduler, mq-deadline / none for NVMe)
▼
┌─────────────────────┐
│ Device driver │ fsync() blocks here
│ (NVMe / SATA AHCI) │ until disk acks
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Storage hardware │ HDD: ~5 ms random
│ HDD / SSD / NVMe │ SSD: ~100 µs random
│ │ NVMe: ~50 µs random
└─────────────────────┘
Three things to internalize from this picture:
- Without
O_DIRECT, you always go through the kernel page cache. Yourpreadmay hit a warm cache (memcpy speed) or cold cache (full disk I/O). Latency variance is enormous. fsyncis the only way to tell the device to flush its own write cache. Without it, "the write returned" means "the kernel accepted it", not "it survives a power loss".mmapandpreadare fundamentally different mental models.mmapmakes I/O implicit (page faults),preadmakes it explicit (syscalls). LMDB chosemmap. SQLite, LevelDB, and PostgreSQL chosepread/pwrite. We will usepread/pwritefor predictability, and discussmmapin the analysis.
4. Core Terminology
| Term | Definition |
|---|---|
| Page | Fixed-size unit of I/O between user and storage. The kernel uses 4 KiB; databases pick 4–32 KiB. We use 4 KiB. |
| Page cache | Kernel-managed RAM that mirrors recently accessed file pages. Transparent to read/write and pread/pwrite. |
pread(fd, buf, n, off) | Read n bytes from fd starting at byte offset off. Does not affect the file pointer. Thread-safe. |
pwrite(fd, buf, n, off) | Write n bytes to fd at byte offset off. Thread-safe. |
mmap | Map a file region into the process's address space. Accesses become loads/stores; faults trigger page-ins. |
fsync(fd) | Block until all dirty data and metadata for fd are on stable storage. The durability primitive. |
fdatasync(fd) | Like fsync but may skip metadata updates that aren't required to retrieve the data. |
O_DIRECT | Open flag (Linux) that bypasses the page cache. Requires 512-byte or 4-KiB alignment on buffers, offsets, sizes. |
F_FULLFSYNC | macOS-only fcntl that actually flushes the drive's cache. fsync on macOS is not enough for true durability. |
| Endianness | Byte order of multi-byte integers in memory. Little-endian = LSB first (x86, ARM default); big-endian = MSB first (network byte order). |
| Alignment | An address being a multiple of N. Matters for SIMD, DMA, O_DIRECT, and many hardware operations. |
| Sector | The atomic write unit of the device. HDDs: 512 B (legacy) or 4 KiB (Advanced Format). NVMe: usually 4 KiB. |
| IOPS | I/O operations per second. The right unit for random workloads (HDD ~150, SATA SSD ~80K, NVMe ~500K–1M). |
| Latency | Time for one operation to complete. Often what users actually feel; throughput hides tail behavior. |
5. Mental Models
The page cache is a transparent cache, not a database. Think of pread like Map::get: if the key is in the cache, it's a memcpy; if it's not, the kernel goes to disk for you. You can't observe a cache miss with timing alone in production — that's the whole point of caches and the whole reason benchmarks lie.
fsync is a phone call to the disk. All other writes are "I told the postman" — fast, no guarantee. fsync is "I waited on the line while the disk confirmed the package arrived." Phone calls are slow. Group commits = bundling 100 packages into one call.
mmap is "make the file look like an array". pread is "I will ask for bytes one request at a time". The first is convenient. The second is predictable. Convenience and predictability are usually at war.
Sequential vs random I/O on an HDD is 100× different. On NVMe it's 2–3×. This is why LSM-trees won the 2000s and why "just append" got rediscovered in the 2010s and why NVMe makes some of those assumptions less critical in the 2020s. Hardware shapes design.
6. Common Misconceptions
- "
writereturning means my data is safe." False. The kernel buffered it. Onlyfsync(orfdatasyncfor data-only, orF_FULLFSYNCon macOS) guarantees durability. - "
mmapis faster thanpreadbecause there's no syscall." Often false.mmapaccess generates page faults, which are also context switches into the kernel, plus they're synchronous (you can't overlap them with computation as easily). LMDB-style designs win when the working set fits in RAM; they suffer on writes due to fsyncing the mapping. - "SSDs make random vs sequential irrelevant." Partially true. Random reads are fast, but random writes still incur garbage collection and write amplification at the firmware level. Sequential writes still reduce hardware WA significantly.
- "4 KiB is always the right page size." No. It matches OS page size, which is friendly for
mmapand for the page cache. But LevelDB uses 4 KiB blocks (read amp) and 64 MiB SSTables (sequential writes). PostgreSQL uses 8 KiB pages. The "right" page size depends on workload. - "
fsyncflushes only my file." On many filesystems and many older kernels,fsynccould flush more (or less) than expected. Modern ext4/xfs are sane, but historical PostgreSQLfsyncbugs (2018) showed that the contract is more subtle than it looks.
7. Interview Talking Points
- "For a write-heavy OLTP workload on local NVMe, I'd start with direct
pwrite+fdatasyncrather thanmmap.mmapmakes durable writes ambiguous —msync(MS_SYNC)is a heavier hammer thanfdatasyncbecause it covers the whole mapping, and you give up control over write ordering." - "My rule of thumb: HDD random read ≈ 5 ms, SATA SSD ≈ 100 µs, NVMe ≈ 50 µs, RAM ≈ 100 ns, L1 ≈ 1 ns. Every five orders of magnitude is where a different design becomes interesting. LSM-trees collapse the gap between random and sequential by converting random writes to sequential ones."
- "
fsyncis what amortizes the difference between 100 commits/sec and 100,000 commits/sec. Group commit batches N concurrent transactions into onefsync, trading latency (one transaction may wait ~5 ms for a batch) for throughput (100× more committed transactions perfsync). Postgres, MySQL InnoDB, and SQLite all do this." - "
O_DIRECTisn't a free win. You skip the page cache, so you have to implement your own cache and your buffers must be aligned. PostgreSQL deliberately uses the page cache and lets the OS do that work for it. Oracle and Sybase useO_DIRECT. The choice depends on whether you trust your buffer manager more than the kernel's."
8. Connections to Other Labs
db-02— uses the page-aligned allocator from here for skip-list and hash-table node storage.db-03— the WAL is literallypwrite+fdatasyncin a loop; this lab gives you the muscle memory.db-06— SSTable blocks are read viapreadat known offsets; this lab is the read side.db-11— the SQLite pager is apread/pwrite-based page cache; you'll reimplement what the kernel does for you here.db-21— revisits I/O withio_uring(Linux) andO_DIRECTfor the advanced cases; this lab establishes the baseline.