db-08 — Block Cache and Iterators

What is it?

Two small, foundational read-path components that every LSM (and most B-tree engines) need:

Block cache — a bounded, in-memory map from (file_id, block_offset) to the decoded block bytes, evicting the least-recently-used entry when full. Sits between the SSTable reader and the OS page cache so that a hot index block or hot data block does not have to be decoded on every query.
Merging iterator — a streaming K-way merge over N pre-sorted sources (memtable, level-0 SSTables, level-1 SSTables, …) that yields each key exactly once, preferring the newest source on ties, and optionally drops tombstones. This is the engine of every LSM read: point lookups, range scans, compaction, and snapshot iteration.

In an LSM, a single user get("k") may have to consult the memtable plus 1–10 SSTables. Without a cache, every miss re-reads (and re-checksums, and re-decodes) blocks from disk; without a merging iterator, range scans cannot present a single ordered view of the live keyspace. Together these two components turn the LSM's "many small sorted runs" representation into the illusion of "one big sorted map" — and they do it without unbounded memory.

These primitives also appear far outside databases:

OS page cache is a block cache for files.
CPU L1/L2/L3 are hardware block caches keyed on physical address.
sort -m and most stream-join operators are merging iterators.
Kafka log compaction, Bigtable scans, and DynamoDB streams all do tournament-style merges across sorted inputs.

How does it work?

            ┌─────────────── BlockCache (cap = N entries) ───────────────┐
get(k)  ──► │  HashMap<(file_id, off), Node*>  +  DoublyLinkedList<Node> │
            │  hit:  splice node to front, return value                  │
            │  miss: insert at front; if full, drop the back node        │
            └────────────────────────────────────────────────────────────┘
                              │
                              ▼
            ┌─────────────── MergingIterator(sources) ───────────────────┐
            │  min-heap of (current_key, src_idx)                        │
            │  Next():                                                   │
            │    pop heap → winner                                       │
            │    advance winner src, push its next key (if any)          │
            │    while heap.top().key == winner.key:                     │
            │       pop & advance older (they are shadowed by winner)    │
            │    if drop_tombstones and winner is tombstone: continue    │
            │    yield (winner.key, winner.entry)                        │
            └────────────────────────────────────────────────────────────┘

Two invariants make this correct:

Per-source sort. Within one source, keys are strictly ascending. The heap therefore needs only the front of each source — never the full set.
Tie-break by source index. Source 0 is newest; on a tie, the newest entry wins and the older copies are drained without being yielded. This is how a put in the memtable shadows an old value in L1, and how a tombstone shadows a value of the same key in any older source.

Terminology

Block — a fixed-ish-size chunk of an SSTable (typically 4 KiB) that is the unit of disk I/O and the unit of block-cache eviction.
Cache hit / miss — was the requested key present in the cache?
Eviction — removing an entry to make room. LRU picks the entry least-recently touched (read or written).
MRU / LRU — most/least recently used end of the list.
K-way merge — merging K already-sorted sequences into one sorted sequence. Optimal comparison cost is O(N log K) for N total entries.
Tournament tree / min-heap — the data structure used to pick the next source to advance in O(log K).
Tombstone — a marker that says "this key has been deleted"; it shadows any older value for the same key until it is dropped during compaction.
Newest-wins — the LSM tie-break rule: source i < j means i is newer.

Mental models

The cache is a bounded hash map with a freshness order. The hash gives you O(1) lookup, the list gives you O(1) eviction of the stalest entry.
The merging iterator is a tournament. K runners, each in their own lane; the heap is the leaderboard; every Next() advances the current leader by one step and re-runs the comparison between the new front of that lane and the rest of the heap.
Tombstones are entries, not absences. They occupy a slot in the stream and only disappear during a compaction that is guaranteed to have seen all older versions of the same key.

Common misconceptions

"LRU is just a list." No — a list alone is O(N) per lookup. The hash map is what makes both operations O(1); the list only encodes the order.
"A merging iterator deduplicates by buffering everything." No. It inspects only the front of each source. Total memory is O(K), regardless of how many entries flow through.
"Newest-wins requires timestamps." Not in an LSM: source ordering already encodes recency (memtable > L0 > L1 > …). Timestamps are a separate concern for MVCC (db-13).
"A block cache replaces the OS page cache." It complements it. The OS caches raw file bytes; the block cache caches decoded/decompressed blocks and shortcuts the verification step (CRC checks, decompression).
"Tombstones can be dropped any time." Only during a compaction that includes the bottom level — otherwise an older live value could re-surface. See db-07 for the rule; db-08 lets the caller decide via a flag.

Talking points (interview-grade)

Why bound the cache by entries vs bytes? Entry-bounded is simpler and fine when blocks are roughly uniform (e.g., RocksDB's default 4 KiB blocks). Production systems bound by bytes (block_cache_size_mb) because compressed block sizes vary widely; we use entry count here to keep the data structure the focus of the lab.
Why a doubly-linked list and not a VecDeque? O(1) removal of an arbitrary node on hit-promotion. VecDeque only gives O(1) at the ends.
Why heap of (key, src) and not heap of full entries? Comparator cost: keys are small and comparable; entries (which may hold large values) are not. Also lets us move the entry out of the source vector with a single std::move / mem::take, avoiding copies.
Why does newest-wins also drain all older entries with that key? Otherwise the iterator would emit duplicates downstream, breaking the "exactly one entry per live key" contract that compaction and range scans depend on.
What about thread safety? Our block cache is single-writer-single-reader by design. Real systems use sharded caches (RocksDB: 64 shards) so each shard has its own mutex and contention is 1/Nth.

Connections to the rest of the curriculum

db-05 (memtable) is the newest source in every read-path merge.
db-06 (SSTable format) produces the sorted entries the merger consumes, and the blocks the cache caches.
db-07 (compaction) is itself a merging iterator with drop_tombstones=true whose output is fed to an SSTable writer. db-08 generalizes that machinery so the read path can use it for point lookups and scans as well.
db-09 (LevelDB-complete) wires this into a full Get/Scan path.
db-13 (MVCC) layers per-key snapshot filtering on top of a merging iterator like this one.

Distributed Systems Engineer — Build Databases & Consensus From Scratch