db-09 — Broader Ideas

Where to take this engine next, and where it already touches the rest of distributed-systems engineering.

Immediate next labs

db-10 — B-tree fundamentals. The "other half" of storage. LSMs optimize for write-heavy workloads with append-only files and amortized rewrites; B-trees optimize for in-place point updates and range scans. Both shapes appear in every production database (often side by side: Postgres heap + WAL is B-tree-like, with TOAST and rolled-back versions reaped by VACUUM; MySQL/InnoDB is B-tree primary + UNDO log).
db-21 — Storage-engine advanced. Compaction, leveled compaction policy, block cache wiring, bloom-filter probing, snapshots, file garbage collection. Everything that "real" LevelDB/RocksDB has that we postponed in db-09.

MANIFEST as a tiny version-edit log is a microcosm of how distributed systems use a log to make state changes atomic. A Raft log is the same pattern at machine granularity: apply changes to a state machine only after they're durably appended to an append-only log.
rename for atomic publish is the local-filesystem analogue of two-phase commit. The OS gives us a strong primitive (rename is atomic under crash) and we lean on it. Distributed systems have to build equivalent primitives (Paxos / Raft / 2PC) because no underlying layer provides them for free.
Newest-source-wins under a total order on writes is exactly how a CRDT LWW-register, a multi-version concurrency control snapshot read, a Kafka log-compaction "last value wins" topic, and a Bigtable per-cell-with-timestamp work. The variable that changes between systems is what defines "newer" (file id here; sequence number in LevelDB proper; timestamp in Cassandra; vector clock in Dynamo).

These are not required for the lab to be green; they are good Saturday afternoon explorations:

Plot read-amplification growth as L0 grows: write N batches, never flush, measure point-lookup latency vs N.
Replace text MANIFEST with LevelDB-style version-edit log; measure flush latency improvement at large live-set counts.
Add a block cache between SstReader::Get and the file bytes; measure hit rate on a Zipfian workload.
Wire bloom filters (db-04) per SSTable; measure how many SSTs you can skip for a typical miss-heavy workload.

Concurrent writers (a real Mutex on the write path, multiple readers via versioned snapshots).
Group commit (batch many WAL appends behind one fsync).
Direct I/O / pwrite-based SST writer to avoid double-buffering.
Checksums on every block read, not only at SST footer level.
A scheduler for background flush + compaction with admission control.
fsync(dir) after every file create / rename to survive metadata-loss scenarios on certain filesystems.

None of these change the shape of the engine — they make the same shape faster and tougher.