db-09 — Broader Ideas
Where to take this engine next, and where it already touches the rest of distributed-systems engineering.
Immediate next labs
- db-10 — B-tree fundamentals. The "other half" of storage. LSMs optimize for write-heavy workloads with append-only files and amortized rewrites; B-trees optimize for in-place point updates and range scans. Both shapes appear in every production database (often side by side: Postgres heap + WAL is B-tree-like, with TOAST and rolled-back versions reaped by VACUUM; MySQL/InnoDB is B-tree primary + UNDO log).
- db-21 — Storage-engine advanced. Compaction, leveled compaction policy, block cache wiring, bloom-filter probing, snapshots, file garbage collection. Everything that "real" LevelDB/RocksDB has that we postponed in db-09.
How this lab's pieces show up in distributed systems
- MANIFEST as a tiny version-edit log is a microcosm of how distributed systems use a log to make state changes atomic. A Raft log is the same pattern at machine granularity: apply changes to a state machine only after they're durably appended to an append-only log.
renamefor atomic publish is the local-filesystem analogue of two-phase commit. The OS gives us a strong primitive (rename is atomic under crash) and we lean on it. Distributed systems have to build equivalent primitives (Paxos / Raft / 2PC) because no underlying layer provides them for free.- Newest-source-wins under a total order on writes is exactly how a CRDT LWW-register, a multi-version concurrency control snapshot read, a Kafka log-compaction "last value wins" topic, and a Bigtable per-cell-with-timestamp work. The variable that changes between systems is what defines "newer" (file id here; sequence number in LevelDB proper; timestamp in Cassandra; vector clock in Dynamo).
Performance experiments worth running later
These are not required for the lab to be green; they are good Saturday afternoon explorations:
- Plot read-amplification growth as L0 grows: write N batches, never flush, measure point-lookup latency vs N.
- Replace text MANIFEST with LevelDB-style version-edit log; measure flush latency improvement at large live-set counts.
- Add a block cache between
SstReader::Getand the file bytes; measure hit rate on a Zipfian workload. - Wire bloom filters (db-04) per SSTable; measure how many SSTs you can skip for a typical miss-heavy workload.
What "production-ready" would require beyond this lab
- Concurrent writers (a real
Mutexon the write path, multiple readers via versioned snapshots). - Group commit (batch many WAL appends behind one fsync).
- Direct I/O /
pwrite-based SST writer to avoid double-buffering. - Checksums on every block read, not only at SST footer level.
- A scheduler for background flush + compaction with admission control.
fsync(dir)after every file create / rename to survive metadata-loss scenarios on certain filesystems.
None of these change the shape of the engine — they make the same shape faster and tougher.