Broader Ideas — Beyond the Minimum

Things worth knowing that aren't in the lab code.

Block-grouped framing (LevelDB / RocksDB)

LevelDB pads records into 32 KB blocks and uses a 1-byte type field (FULL, FIRST, MIDDLE, LAST) to handle records that straddle blocks. Benefit: corruption in one block can't propagate; recovery can resync to the next block boundary. Cost: more code, slightly more space.

Group commit, properly

Real systems run a "log writer" goroutine/thread:

clients ──► append to buffer ──► wake writer ──► fsync once ──► broadcast cond var

The writer batches all records that arrived during the previous fsync into the next fsync. Latency stays bounded by (max fsync time) + (one batch fill); throughput scales until you saturate the SSD's IOPS.

`O_DSYNC` vs application-level `fsync`

O_DSYNC makes every write() durable before returning. Removes the need for explicit fsync, but you lose the chance to batch. Real DBs prefer explicit fsync for that reason.

`sync_file_range` and friends

Linux-only. sync_file_range(fd, off, len, SYNC_FILE_RANGE_WRITE) flushes only a byte range. PostgreSQL uses this for "lazy" checkpoints to avoid stalling on huge fsyncs. Doesn't sync metadata, so still need a final fsync.

Pre-allocation & fallocate

For predictable I/O, pre-allocate the next WAL segment with fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE). This avoids metadata updates on each grow and gives the FS a contiguous extent. PostgreSQL pre-zeroes 16 MB segments.

Direct I/O & alignment

O_DIRECT bypasses the page cache; useful when the DB has its own buffer pool. Requires 512 B or 4 KB aligned buffers and offsets. Modern recommendation: prefer io_uring + O_DIRECT over POSIX AIO. Returns in db-21.

Mixing data files and WAL on the same disk

Bad idea for HDDs (head contention), neutral for SSDs (no head), bad for low-end SSDs (write amplification competes). Production systems put WAL on a separate device when latency-sensitive.

When the WAL is the database

LSM-trees, Kafka, NATS JetStream, Pulsar, Apache BookKeeper — these treat the log as the primary structure and let secondary indexes / merge trees / consumers catch up. The data file in our toy example was hypothetical; LSMs make it explicit. See db-05 onward.

Encryption / compression

Compression per record: trivial, but blocks the Vec<u8> reuse pattern. Better to compress whole segments at checkpoint time.
Encryption per record: AEAD (AES-GCM or ChaCha20-Poly1305) replaces CRC32 — the auth tag is your CRC. PostgreSQL's TDE proposals use this.

Replication

Once you have a sequential log of operations, replicating it is "just" send-and-replay. This is the entire conceptual basis of Raft and ZAB — see db-17 / db-19. The framing tricks here transfer directly.

What goes wrong at scale

fsync amplification: every fsync touches the FS journal, which serializes against other fsyncs. Solution: large group commit batches.
Long fsync tails: 99th-percentile fsync on a busy NVMe can be 100ms+. Solution: pipeline; never block a hot-path thread on fsync.
Filesystems that lie: ext4 with data=writeback may complete fsync before journaling. APFS, ZFS, btrfs each have their own quirks. Empirical test with fio is the only safe answer.

Distributed Systems Engineer — Build Databases & Consensus From Scratch