Write-Ahead Log (WAL)

Status: complete — runnable in Rust, Go, C++.

1. What Is It

A WAL is an append-only file that records intent-to-modify before the actual data pages are updated. On crash, the recovery routine replays the WAL from the last checkpoint, restoring the database to a consistent state.

client write  ──►  append record  ──►  fsync(WAL)  ──►  ack client
                                                          │
                                                          ▼
                                                  later: apply to data file
                                                  later: checkpoint & truncate WAL

The critical invariant is the write order: WAL hits stable storage before the client is told the write committed. If the process dies between the fsync and the data-page update, recovery re-applies the logged operation. If it dies before the fsync, the client never got an ack, so losing the record is acceptable.

2. Why It Matters

Without WAL	With WAL
Random in-place writes to the data file	Sequential append (10–100× faster on HDD, still better on SSD)
Each commit = random-page fsync	Each commit = single sequential append + fsync
Crash mid-update ⇒ torn page, corrupt file	Crash ⇒ replay log, idempotent recovery
Group commit impossible	Multiple commits batched into one fsync ("group commit")

Every serious database has one: PostgreSQL's WAL, MySQL's redo log, SQLite's WAL mode, RocksDB's WAL, LevelDB's LOG file, Kafka's segments.

3. How It Works

Record framing used in this lab (mirrors LevelDB's simplified record format minus the block-grouping):

 ┌─────────┬─────────┬──────────────────────┐
 │ len(u32)│ crc(u32)│ payload (len bytes)  │
 └─────────┴─────────┴──────────────────────┘
       4         4              N

len and crc are little-endian.
crc is CRC-32 (IEEE 802.3 polynomial, reflected) of the payload only.
Records are written back-to-back with no padding.
Recovery iterates from offset 0 and stops at the first record whose header is short, whose payload is short, or whose CRC fails. The valid prefix is replayed; the bad tail is silently truncated on next open.

file:
   ┌───┬───┬─────┬───┬───┬─────┬───┬───┬──┐  ← crashed mid-write
   │L=8│CRC│ A…  │L=4│CRC│ B…  │L=9│CRC│??│
   └───┴───┴─────┴───┴───┴─────┴───┴───┴──┘
                                ▲          ▲
                                │          └─ short payload → stop, truncate from here
                                └─ last fully-flushed record

4. Core Terminology

Term	Definition
Record	Self-describing unit: header (len + crc) followed by payload bytes.
`fsync`	Syscall asking the kernel to flush dirty pages and inode metadata to disk.
`fdatasync`	Like `fsync` but skips metadata if only data changed. Slightly faster.
Group commit	Coalescing multiple in-flight appends into one shared `fsync`.
Torn write	A write the device split into two physical sectors, only one of which made it. CRC catches this.
Tail truncation	Scanning forward at open and discarding any partial trailing record.
Checkpoint	Flush dirty pages to data file, record a WAL position beyond which replay is unnecessary.

5. Mental Models

The log is the source of truth, the data file is the cache. Recovery reconstructs from the log.
CRC is for detection, not correction. It tells you where the good prefix ends; it does not heal damage.
fsync is a barrier, not a universal durability guarantee. Consumer SSDs and FUSE filesystems sometimes lie. Use fio --fdatasync=1 to spot-check hardware.
Sequential I/O wins. Even on SSDs, sequential writes have better SLC-cache and GC behavior than random ones.

6. Common Misconceptions

"write() already put it on disk." No — kernel page cache. fsync is required for durability.
"CRC + length is enough." Necessary, not sufficient. A record with both len and crc zeroed is indistinguishable from len=0,crc=crc32([]). We disallow len=0 (treat as EOF).
"Group commit hurts latency." Tiny median bump for a 10–100× throughput win and lower tail latency under load.
"fsync == O_DIRECT." Different layers. O_DIRECT bypasses the page cache; fsync flushes it.

7. Interview Talking Points

Distinguish redo log (PostgreSQL/InnoDB), WAL mode (SQLite WAL), journal mode (SQLite default).
Why CRC the payload, not the header? (Need length first; header CRC catches the wrong failures.)
fsync vs fdatasync vs sync_file_range.
Group commit mechanics and tradeoff.
Why WAL alone doesn't beat torn writes on the data file → full-page writes in WAL after each checkpoint.

8. Connections to Other Labs

db-01 — every append here is a pwrite + fsync from db-01.
db-05, db-09 — LSM writes always hit the WAL first.
db-11 — SQLite WAL mode reuses this exact pattern around a B-tree pager.
db-13 — commit records & 2PC live in the WAL.
db-17 — Raft's replicated log is, mechanically, a WAL.

Distributed Systems Engineer — Build Databases & Consensus From Scratch