Write-Ahead Log (WAL)
Status: complete — runnable in Rust, Go, C++.
1. What Is It
A WAL is an append-only file that records intent-to-modify before the actual data pages are updated. On crash, the recovery routine replays the WAL from the last checkpoint, restoring the database to a consistent state.
client write ──► append record ──► fsync(WAL) ──► ack client
│
▼
later: apply to data file
later: checkpoint & truncate WAL
The critical invariant is the write order: WAL hits stable storage before the client is told the write committed. If the process dies between the fsync and the data-page update, recovery re-applies the logged operation. If it dies before the fsync, the client never got an ack, so losing the record is acceptable.
2. Why It Matters
| Without WAL | With WAL |
|---|---|
| Random in-place writes to the data file | Sequential append (10–100× faster on HDD, still better on SSD) |
| Each commit = random-page fsync | Each commit = single sequential append + fsync |
| Crash mid-update ⇒ torn page, corrupt file | Crash ⇒ replay log, idempotent recovery |
| Group commit impossible | Multiple commits batched into one fsync ("group commit") |
Every serious database has one: PostgreSQL's WAL, MySQL's redo log, SQLite's WAL mode, RocksDB's WAL, LevelDB's LOG file, Kafka's segments.
3. How It Works
Record framing used in this lab (mirrors LevelDB's simplified record format minus the block-grouping):
┌─────────┬─────────┬──────────────────────┐
│ len(u32)│ crc(u32)│ payload (len bytes) │
└─────────┴─────────┴──────────────────────┘
4 4 N
lenandcrcare little-endian.crcis CRC-32 (IEEE 802.3 polynomial, reflected) of the payload only.- Records are written back-to-back with no padding.
- Recovery iterates from offset 0 and stops at the first record whose header is short, whose payload is short, or whose CRC fails. The valid prefix is replayed; the bad tail is silently truncated on next open.
file:
┌───┬───┬─────┬───┬───┬─────┬───┬───┬──┐ ← crashed mid-write
│L=8│CRC│ A… │L=4│CRC│ B… │L=9│CRC│??│
└───┴───┴─────┴───┴───┴─────┴───┴───┴──┘
▲ ▲
│ └─ short payload → stop, truncate from here
└─ last fully-flushed record
4. Core Terminology
| Term | Definition |
|---|---|
| Record | Self-describing unit: header (len + crc) followed by payload bytes. |
fsync | Syscall asking the kernel to flush dirty pages and inode metadata to disk. |
fdatasync | Like fsync but skips metadata if only data changed. Slightly faster. |
| Group commit | Coalescing multiple in-flight appends into one shared fsync. |
| Torn write | A write the device split into two physical sectors, only one of which made it. CRC catches this. |
| Tail truncation | Scanning forward at open and discarding any partial trailing record. |
| Checkpoint | Flush dirty pages to data file, record a WAL position beyond which replay is unnecessary. |
5. Mental Models
- The log is the source of truth, the data file is the cache. Recovery reconstructs from the log.
- CRC is for detection, not correction. It tells you where the good prefix ends; it does not heal damage.
fsyncis a barrier, not a universal durability guarantee. Consumer SSDs and FUSE filesystems sometimes lie. Usefio --fdatasync=1to spot-check hardware.- Sequential I/O wins. Even on SSDs, sequential writes have better SLC-cache and GC behavior than random ones.
6. Common Misconceptions
- "
write()already put it on disk." No — kernel page cache.fsyncis required for durability. - "CRC + length is enough." Necessary, not sufficient. A record with both
lenandcrczeroed is indistinguishable fromlen=0,crc=crc32([]). We disallowlen=0(treat as EOF). - "Group commit hurts latency." Tiny median bump for a 10–100× throughput win and lower tail latency under load.
- "fsync == O_DIRECT." Different layers.
O_DIRECTbypasses the page cache;fsyncflushes it.
7. Interview Talking Points
- Distinguish redo log (PostgreSQL/InnoDB), WAL mode (SQLite WAL), journal mode (SQLite default).
- Why CRC the payload, not the header? (Need length first; header CRC catches the wrong failures.)
fsyncvsfdatasyncvssync_file_range.- Group commit mechanics and tradeoff.
- Why WAL alone doesn't beat torn writes on the data file → full-page writes in WAL after each checkpoint.