LSM MemTable
Lab: db-05 — the in-memory write buffer of an LSM-tree.
1. What Is It
A MemTable is the in-memory, sorted write buffer at the top of every Log-Structured
Merge tree (LSM). All writes — put, delete, range updates — land in the MemTable
first, indexed by key, and only later get flushed to immutable on-disk SSTables
(see db-06). It is paired with a Write-Ahead Log (db-03) for durability: WAL gives
crash recovery; the MemTable gives fast point and range lookups.
This lab implements a deterministic, byte-identical MemTable across Rust, Go, and C++ that can be serialized to disk and read back in any of the three languages.
2. Why It Matters
- Write throughput. Writes touch only RAM (plus a single sequential WAL append).
Random
puts become sequential disk traffic. - Read recency. The MemTable is the freshest copy of any key; a get must consult it first before falling through to L0..Ln SSTables.
- Flush boundary. Once the MemTable hits its size cap (
write_buffer_sizein LevelDB/RocksDB), it freezes, a new MemTable rotates in, and the frozen one is written sequentially to an SSTable on background threads. - Tombstones. Deletes are inserts of tombstone records; the MemTable must preserve them through flush so older SSTables can be shadowed.
3. How It Works
writes reads
│ │
▼ ▼
┌──────── MemTable (active) ─────────┐ point/range
│ sorted map: key → (type, value) │◄──────────┐
│ tombstones live alongside values │ │
└──────────────────┬─────────────────┘ │
size > cap? │ │
▼ │
┌── Immutable MemTable (frozen) ─┐ │
│ flushing in the background │◄───────────┤
└──────────────────┬──────────────┘ │
▼ │
SSTable on disk ─────────────────►┘
(db-06 format)
Internally the MemTable is a sorted associative container with byte-lexicographic key ordering:
- Rust:
BTreeMap<Vec<u8>, Entry>(Vec<u8>'sOrdis lex over bytes). - Go:
map[string]Entry+ key slice sorted on dump/iteration. - C++:
std::map<std::vector<uint8_t>, Entry>(operator< on vectors is lex).
Production LSMs (LevelDB, RocksDB) use a skip list because it offers concurrent lock-free reads and allocations from an arena. For this lab the simpler tree is fine — what matters is the order-determinism and the on-disk byte layout.
4. Core Terminology
| Term | Definition |
|---|---|
| MemTable | Sorted in-memory map of keys to values/tombstones; the LSM write buffer. |
| Immutable MemTable | A frozen MemTable, no longer accepting writes, awaiting flush. |
| Tombstone | A delete marker stored as an entry of type Delete; needed because older SSTables may still hold the key. |
| Skip list | Randomized layered linked-list giving expected O(log n) insert/lookup; LevelDB/RocksDB's choice. |
| Flush | Writing a frozen MemTable out as an SSTable. |
| Sequence number | Monotonically increasing version tag attached to each write so readers can pick the right snapshot. |
| Arena | Bump allocator that backs MemTable nodes; freed in one go when the table is dropped. |
5. Mental Models
- Three-layer journal. WAL = durability log. MemTable = sorted index over the WAL's recent tail. SSTable = compacted, immutable snapshot. The MemTable is the short-term, queryable face of the WAL.
- Latest write wins. For a single point lookup the MemTable always shadows any on-disk data; a tombstone in the MemTable hides every prior value of that key.
- Flush is amplification's first knob. Larger MemTables → fewer, bigger L0 SSTables → less compaction work but more recovery time and RAM. Production tunes this between 16 MiB and 256 MiB.
- Why sorted? Because the flush writes the SSTable in a single sequential pass — no on-disk sort needed if the MemTable is already ordered.
6. Common Misconceptions
- "The MemTable is the WAL." No. The WAL is unsorted, append-only, and may contain redundant updates for the same key. The MemTable is sorted and deduplicated.
- "Tombstones can be GC'd in the MemTable." No — they must be flushed; only after compaction confirms no older SSTable holds the key can a tombstone be dropped.
- "You can skip the WAL if writes are batched." The MemTable lives in RAM. Without the WAL a crash loses every unflushed write.
- "Skip list is the only valid structure." A B-tree, ART, or sorted vector with occasional rebuild are all viable; skip list wins for the specific concurrency pattern of one writer + many readers.
7. Interview Talking Points
- Explain why an LSM uses a MemTable + WAL instead of writing directly to a sorted on-disk file (random I/O kills throughput).
- Walk through the lifecycle:
put→ WAL append → MemTable insert → eventually frozen → flushed → compacted. - Describe how a get traverses MemTable → immutable MemTable → L0 SSTables → Ln, stopping at the first match (value or tombstone).
- Cost of tombstones: read amplification grows because we cannot skip a level just because we found nothing; we might still find a tombstone later.
- Why a MemTable's flush is a sorted sequential write — and why this is the primary trick that makes LSMs faster than B-trees for write-heavy workloads.
8. Connections to Other Labs
- db-03 (WAL): every MemTable write is preceded by a WAL append; the WAL is replayed into a fresh MemTable on startup.
- db-04 (Bloom filters): SSTables produced by MemTable flush carry Bloom filters for negative lookups.
- db-06 (SSTable format): the flush target; this lab's
flush_tois the producer side of db-06'sopen. - db-07 (compaction): consumes SSTables that came from MemTable flushes.
- db-09 (LevelDB complete): stitches all of the above into a working KV store.