LSM MemTable

Lab: db-05 — the in-memory write buffer of an LSM-tree.

1. What Is It

A MemTable is the in-memory, sorted write buffer at the top of every Log-Structured Merge tree (LSM). All writes — put, delete, range updates — land in the MemTable first, indexed by key, and only later get flushed to immutable on-disk SSTables (see db-06). It is paired with a Write-Ahead Log (db-03) for durability: WAL gives crash recovery; the MemTable gives fast point and range lookups.

This lab implements a deterministic, byte-identical MemTable across Rust, Go, and C++ that can be serialized to disk and read back in any of the three languages.

2. Why It Matters

  • Write throughput. Writes touch only RAM (plus a single sequential WAL append). Random puts become sequential disk traffic.
  • Read recency. The MemTable is the freshest copy of any key; a get must consult it first before falling through to L0..Ln SSTables.
  • Flush boundary. Once the MemTable hits its size cap (write_buffer_size in LevelDB/RocksDB), it freezes, a new MemTable rotates in, and the frozen one is written sequentially to an SSTable on background threads.
  • Tombstones. Deletes are inserts of tombstone records; the MemTable must preserve them through flush so older SSTables can be shadowed.

3. How It Works

                writes                      reads
                  │                           │
                  ▼                           ▼
   ┌──────── MemTable (active) ─────────┐  point/range
   │   sorted map: key → (type, value)  │◄──────────┐
   │   tombstones live alongside values │           │
   └──────────────────┬─────────────────┘           │
        size > cap?   │                              │
                      ▼                              │
       ┌── Immutable MemTable (frozen) ─┐            │
       │   flushing in the background    │◄───────────┤
       └──────────────────┬──────────────┘           │
                          ▼                          │
                   SSTable on disk ─────────────────►┘
                   (db-06 format)

Internally the MemTable is a sorted associative container with byte-lexicographic key ordering:

  • Rust: BTreeMap<Vec<u8>, Entry> (Vec<u8>'s Ord is lex over bytes).
  • Go: map[string]Entry + key slice sorted on dump/iteration.
  • C++: std::map<std::vector<uint8_t>, Entry> (operator< on vectors is lex).

Production LSMs (LevelDB, RocksDB) use a skip list because it offers concurrent lock-free reads and allocations from an arena. For this lab the simpler tree is fine — what matters is the order-determinism and the on-disk byte layout.

4. Core Terminology

TermDefinition
MemTableSorted in-memory map of keys to values/tombstones; the LSM write buffer.
Immutable MemTableA frozen MemTable, no longer accepting writes, awaiting flush.
TombstoneA delete marker stored as an entry of type Delete; needed because older SSTables may still hold the key.
Skip listRandomized layered linked-list giving expected O(log n) insert/lookup; LevelDB/RocksDB's choice.
FlushWriting a frozen MemTable out as an SSTable.
Sequence numberMonotonically increasing version tag attached to each write so readers can pick the right snapshot.
ArenaBump allocator that backs MemTable nodes; freed in one go when the table is dropped.

5. Mental Models

  • Three-layer journal. WAL = durability log. MemTable = sorted index over the WAL's recent tail. SSTable = compacted, immutable snapshot. The MemTable is the short-term, queryable face of the WAL.
  • Latest write wins. For a single point lookup the MemTable always shadows any on-disk data; a tombstone in the MemTable hides every prior value of that key.
  • Flush is amplification's first knob. Larger MemTables → fewer, bigger L0 SSTables → less compaction work but more recovery time and RAM. Production tunes this between 16 MiB and 256 MiB.
  • Why sorted? Because the flush writes the SSTable in a single sequential pass — no on-disk sort needed if the MemTable is already ordered.

6. Common Misconceptions

  • "The MemTable is the WAL." No. The WAL is unsorted, append-only, and may contain redundant updates for the same key. The MemTable is sorted and deduplicated.
  • "Tombstones can be GC'd in the MemTable." No — they must be flushed; only after compaction confirms no older SSTable holds the key can a tombstone be dropped.
  • "You can skip the WAL if writes are batched." The MemTable lives in RAM. Without the WAL a crash loses every unflushed write.
  • "Skip list is the only valid structure." A B-tree, ART, or sorted vector with occasional rebuild are all viable; skip list wins for the specific concurrency pattern of one writer + many readers.

7. Interview Talking Points

  • Explain why an LSM uses a MemTable + WAL instead of writing directly to a sorted on-disk file (random I/O kills throughput).
  • Walk through the lifecycle: put → WAL append → MemTable insert → eventually frozen → flushed → compacted.
  • Describe how a get traverses MemTable → immutable MemTable → L0 SSTables → Ln, stopping at the first match (value or tombstone).
  • Cost of tombstones: read amplification grows because we cannot skip a level just because we found nothing; we might still find a tombstone later.
  • Why a MemTable's flush is a sorted sequential write — and why this is the primary trick that makes LSMs faster than B-trees for write-heavy workloads.

8. Connections to Other Labs

  • db-03 (WAL): every MemTable write is preceded by a WAL append; the WAL is replayed into a fresh MemTable on startup.
  • db-04 (Bloom filters): SSTables produced by MemTable flush carry Bloom filters for negative lookups.
  • db-06 (SSTable format): the flush target; this lab's flush_to is the producer side of db-06's open.
  • db-07 (compaction): consumes SSTables that came from MemTable flushes.
  • db-09 (LevelDB complete): stitches all of the above into a working KV store.