Analysis — Write-Ahead Log

Problem statement

Make a stream of writes durable and crash-recoverable without paying for a random in-place fsync per write.

Constraints

ConstraintWhy it matters
Append-only fileSequential I/O, ~10–100× faster than random on HDD; better SLC/GC behavior on SSD.
Self-describing recordsRecovery must work without a side index. Header = length + checksum.
Truncation-tolerant tailCrash mid-write leaves a partial record. We must detect & ignore it on next open.
Single writerWe do not address multi-writer log multiplexing here (Kafka does).
No structural guarantees from the FSDon't assume ordering of metadata vs data, or that 4KB writes are atomic.

Design decisions

  1. Header = len(u32 LE) || crc32(u32 LE). Small (8 B), aligned, endian-fixed. We deliberately keep the CRC out of the length so we can stream-checksum the body.
  2. CRC is over the payload only. The header itself is implicitly validated by use — a corrupt len either points beyond EOF (short read) or to data whose CRC won't match.
  3. len == 0 is disallowed, used as the EOF sentinel. Empty payloads are rare in practice and avoiding the ambiguity simplifies the reader (len=0,crc=0 happens naturally in a hole of zeros from a sparse file or pre-allocated extent).
  4. Little-endian on disk. Everyone runs LE now (x86, ARM, RISC-V — even POWER prefers it). No htole32 dance saves ~5 LOC per language.
  5. CRC table generated at startup, not hardcoded. 1 KB, computed in microseconds. Easier to audit, and lets us swap polynomials in tests.
  6. One file, one writer, one fd. No segment rotation in this lab — that lives in db-07 (compaction) and db-09 (LevelDB). Single-file WAL is enough to teach framing.
  7. sync() is a separate method. The caller decides commit boundaries. Production systems may add append_sync(payload) that batches a group commit; we leave that for bench mode.

Why this design over alternatives

  • vs LevelDB's block-grouped framing: LevelDB pads records to 32KB blocks for alignment and easier corruption isolation. Beautiful, but doubles the code volume. We follow this lab's bias of "minimum to teach the concept, plus one cross-language cross-check".
  • vs JSON / protobuf framing: would require schema management. CRC + raw bytes is the smallest possible recoverable framing.
  • vs a per-record fsync: we expose a separate sync() so the user can choose between durability per-record (call after every append) and group commit (call periodically).

Failure modes addressed

FailureDetection
Partial header at EOFHeader read < 8 bytes ⇒ stop iteration.
Header OK, partial payloadPayload read < len ⇒ stop iteration.
Full record, CRC mismatch (bit-flip)CRC32 over payload ≠ stored CRC ⇒ stop iteration.
Hole of zeros (sparse FS, preallocated)len == 0 is the EOF sentinel ⇒ stop iteration cleanly.
Disk fully lying about fsyncOut of scope; mention fio --fdatasync=1 to detect.

Failure modes NOT addressed in this lab

  • Bit-flip in the header itself that produces a plausible (len, crc) pair (probability ≈ 2⁻³²). Production systems mitigate with a record-type byte (LevelDB) or magic bytes (Kafka).
  • Multi-process writers. Use O_APPEND + ≤PIPE_BUF append for that; see db-09 / db-21.
  • Disk full mid-write. Treat as torn write at EOF (the trailing record fails CRC and is dropped on recovery); the caller's append() returns an I/O error that they must handle.