Broader Ideas — Beyond the Minimum
Things worth knowing that aren't in the lab code.
Block-grouped framing (LevelDB / RocksDB)
LevelDB pads records into 32 KB blocks and uses a 1-byte type field (FULL, FIRST, MIDDLE, LAST) to handle records that straddle blocks. Benefit: corruption in one block can't propagate; recovery can resync to the next block boundary. Cost: more code, slightly more space.
Group commit, properly
Real systems run a "log writer" goroutine/thread:
clients ──► append to buffer ──► wake writer ──► fsync once ──► broadcast cond var
The writer batches all records that arrived during the previous fsync into the next fsync. Latency stays bounded by (max fsync time) + (one batch fill); throughput scales until you saturate the SSD's IOPS.
O_DSYNC vs application-level fsync
O_DSYNC makes every write() durable before returning. Removes the need for explicit fsync, but you lose the chance to batch. Real DBs prefer explicit fsync for that reason.
sync_file_range and friends
Linux-only. sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE) flushes only a byte range. PostgreSQL uses this for "lazy" checkpoints to avoid stalling on huge fsyncs. Doesn't sync metadata, so still need a final fsync.
Pre-allocation & fallocate
For predictable I/O, pre-allocate the next WAL segment with fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE). This avoids metadata updates on each grow and gives the FS a contiguous extent. PostgreSQL pre-zeroes 16 MB segments.
Direct I/O & alignment
O_DIRECT bypasses the page cache; useful when the DB has its own buffer pool. Requires 512 B or 4 KB aligned buffers and offsets. Modern recommendation: prefer io_uring + O_DIRECT over POSIX AIO. Returns in db-21.
Mixing data files and WAL on the same disk
Bad idea for HDDs (head contention), neutral for SSDs (no head), bad for low-end SSDs (write amplification competes). Production systems put WAL on a separate device when latency-sensitive.
When the WAL is the database
LSM-trees, Kafka, NATS JetStream, Pulsar, Apache BookKeeper — these treat the log as the primary structure and let secondary indexes / merge trees / consumers catch up. The data file in our toy example was hypothetical; LSMs make it explicit. See db-05 onward.
Encryption / compression
- Compression per record: trivial, but blocks the
Vec<u8>reuse pattern. Better to compress whole segments at checkpoint time. - Encryption per record: AEAD (AES-GCM or ChaCha20-Poly1305) replaces CRC32 — the auth tag is your CRC. PostgreSQL's TDE proposals use this.
Replication
Once you have a sequential log of operations, replicating it is "just" send-and-replay. This is the entire conceptual basis of Raft and ZAB — see db-17 / db-19. The framing tricks here transfer directly.
What goes wrong at scale
- fsync amplification: every fsync touches the FS journal, which serializes against other fsyncs. Solution: large group commit batches.
- Long fsync tails: 99th-percentile fsync on a busy NVMe can be 100ms+. Solution: pipeline; never block a hot-path thread on fsync.
- Filesystems that lie: ext4 with
data=writebackmay completefsyncbefore journaling. APFS, ZFS, btrfs each have their own quirks. Empirical test withfiois the only safe answer.