Broader Ideas — Storage Primitives

Where to go after this lab if you want to push deeper. Each idea is a self-contained extension or alternative.

1. Replace pread with io_uring (Linux)

The single biggest jump from this lab's design to a modern engine is moving from synchronous syscalls to async submission queues. With pread at QD=1, NVMe runs at ~5% of its IOPS. With io_uring at QD=32+, it hits the spec sheet.

  • Lab pointer: db-21-storage-engine-advanced does this end-to-end.
  • Self-study: implement a pread_async API now that internally still uses pread but queues requests through a crossbeam channel (Rust) / goroutine pool (Go) / std::jthread worker pool (C++). When you then swap the backend for io_uring, no API consumer changes.
  • Reference: Jens Axboe's "Efficient IO with io_uring" (https://kernel.dk/io_uring.pdf), §3.

2. Page Layout — Slotted Pages vs Fixed-Size Records

Our pages are zero-padded ASCII. Real engines use slotted pages:

┌────────┬────────────────────────────┬──────┐
│ header │ slot[0] slot[1] ...        │ free │
│        │ → offsets into page        │      │
├────────┴────────────────────────────┴──────┤
│ ← record N ← record 1 ← record 0           │  (grows from end)
└────────────────────────────────────────────┘

This lets variable-length records share a page without external fragmentation. PostgreSQL, MySQL InnoDB, and SQLite all use slotted pages. Try this: extend pagealloc so each page holds a slot directory and stores up to 16 variable-length keys per page. This is the warm-up for db-10.

3. Copy-on-Write Pages (LMDB-style)

Instead of overwriting a page in place, allocate a fresh page and update the parent to point to it. This is how LMDB achieves single-writer MVCC without a WAL. Pros: simpler crash recovery (just point at the last committed root). Cons: requires a GC for unreferenced pages, doubles write amplification.

  • Reference: Howard Chu's LMDB tech docs, http://www.lmdb.tech/doc/
  • Self-study: extend the allocator to track free pages and never overwrite; introduce a "commit" op that just writes a new root pointer atomically.

4. Write Coalescing & Group Commit

Right now every write calls fsync immediately. Even a single concurrent writer benefits from group commit:

#![allow(unused)]
fn main() {
// Pseudocode
let mut pending = vec![];
loop {
    pending.push(receive_write_request());
    if elapsed_since_last_fsync > 100us || pending.len() > 64 {
        pwrite_all(pending);
        fdatasync();
        for req in pending.drain(..) { req.notify_done(); }
    }
}
}
  • Lab pointer: db-03-write-ahead-log builds this for the WAL. Try it here as warm-up.
  • Trade-off: latency increases by 100us, throughput rises by ~50× under contention.

5. Direct I/O + Aligned Buffers

O_DIRECT (Linux) or F_NOCACHE (macOS) bypasses the page cache. To use it you need 4-KiB-aligned buffers (in Rust: Layout::from_size_align(4096, 4096)?; in C++: posix_memalign(&buf, 4096, 4096); in Go: trickier — use golang.org/x/sys/unix.Mmap with MAP_ANON).

  • When this matters: when your app has a better cache than the kernel (e.g., Phase 2's block cache). Oracle, MySQL with O_DIRECT, and most flash-tuned engines pick this.
  • Self-study: add a pagealloc write-direct subcommand that opens with O_DIRECT and demonstrates the alignment requirement (the program must fail predictably if the buffer is unaligned).

6. Sparse Files & Hole Punching

Files don't have to be contiguous. fallocate(FALLOC_FL_PUNCH_HOLE) releases blocks back to the filesystem without changing the file size. Useful for LSM-tree SSTable compaction (free space after removing dead keys) and for journal log truncation.

  • Reference: man 2 fallocate
  • Self-study: add pagealloc punch <file> <page_no> and verify with du -h <file> that the file's apparent size is unchanged but on-disk size shrinks.

7. Crash Testing with dm-flakey (Linux)

The hard part of storage code is testing the failure cases. dm-flakey is a Linux device-mapper target that simulates random write failures.

# 5-second window of normal operation, then 1 second of dropping writes, repeat.
sudo dmsetup create flakey-dev --table "0 $size flakey /dev/loop0 0 5 1 1 drop_writes"

Mount your test filesystem on /dev/mapper/flakey-dev and run your pagealloc write loop across the drop window. Without fsync, you should lose data. With fsync, the writes that completed should survive. This is how the real engines test durability.

8. Comparing mmap Yourself

We argued for pread/pwrite. Don't take our word for it — implement pagealloc-mmap as a fourth implementation. Compare:

Workloadpreadmmap
Sequential read of 1 GB??
Random read of 4 KiB × 10⁶ from a 1 GB file (warm)??
Random read from a 100 GB file (cold)??
10⁵ random writes with durability??

Plot the results, write down what surprised you. Bring those numbers to the mmap Pavlo paper (in references.md) and check whether they match.

9. Persistent Memory (PMEM, Optane)

Intel Optane is dead, but Persistent Memory programming patterns survive in CXL.mem and in research kernels. PMEM is byte-addressable like RAM, persistent like SSD, with clwb + sfence as the durability primitive (no fsync). The persistent memory programming library (PMDK) is what to read.

  • Reference: https://pmem.io/pmdk/
  • Why it matters: if/when CXL persistent memory becomes commodity, every storage engine in this curriculum will need a rewrite. Already, WiscKey, SplitFS, and uTree are research designs assuming PMEM.

10. Beyond Disk: Object Storage as a Backing Store

Modern cloud-native databases (Snowflake, Databricks, BigQuery) don't pwrite to local disks — they PUT 4 MiB objects to S3. The trade-offs are wildly different (high latency, infinite throughput, eventual consistency until 2020). The closest "primitives lab" for that world would replace pread/pwrite with HTTP range requests. Worth thinking about, especially before db-23's capstone.

  • Reference: "Lakehouse: A New Generation of Open Platforms" (Armbrust et al., CIDR 2021)